What is it?
AIOps is a term that has been used in the last few years to describe the ability to drive intelligence from the day-to-day data that IT operations generate. The data source could vary from monitoring tools like SolarWinds to service desk tools like ServiceNow to automation tools like configuration management ( chef, puppet … ), or log search platforms like Splunk
One area where AIOps can be an asset to operation teams is incident predictability and remediation, there are others like storage and capacity management, resources utilization …
How can AIOPS help prevent the next outage :
the footprint of digital systems and businesses is increasing every day and so is the speed at which the data is produced.
For example, a Palo Alto firewall can produce up to 12 million events in one day, the manual correlation of data is nearly impossible, and that’s why we need an overview of the entire landscape of data produced by IT operations, transformation of data to be able to serve as training and test sets for machine learning.
Starting from the promise that an incident is a result of a change ( voluntary or involuntary) to a configuration, a device, a network, or an application, all these changes if monitored and reported on correctly can help create a good context to understand the root-cause analysis of the incident.
You can create an ML model that will help you predict the next outage, notify operation teams, and help reduce the downtime.
Suppose that you transformed the input data that you gathered from all your sources, organized it into dataset like the one below and used a supervised learning process to create an ML model :
your model will be able to make predictions of future incidents when fed with real-time input coming from your tools and logs :
over time, with more data, your model will get better at detecting future anomalies, with much more accuracy.
There is a lot of writing out there about AIOps, but the application, in my opinion, is a bit harder.
For different reasons, one being the spectrum of toolset in IT operations is very wide, and two being that the data structures are different from one organization to another, which means that trying to put a generic machine learning process to produce insights, will be at worst impossible and at best will lack accuracy.
For an organization to be able to get intelligent insights from AIOps, there has to be an internal effort to train your models, because the quality of your future prediction of major incidents will essentially depend on the quality of your training and test sets.