How can AIOps help you prevent the next major incident.

What is it?

AIOps is a term that has been used in the last few years to describe the ability to drive intelligence from the day-to-day data that IT operations generate. The data source could vary from monitoring tools like SolarWinds to service desk tools like ServiceNow to automation tools like configuration management ( chef, puppet … ), or log search platforms like Splunk

Untitled Diagram (5)

One area where AIOps can be an asset to operation teams is incident predictability and remediation, there are others like storage and capacity management, resources utilization …

How can AIOPS help prevent the next outage :

the footprint of digital systems and businesses is increasing every day and so is the speed at which the data is produced.

For example, a Palo Alto firewall can produce up to 12 million events in one day, the manual correlation of data is nearly impossible, and that’s why we need an overview of the entire landscape of data produced by IT operations,  transformation of data to be able to serve as training and test sets for machine learning.

Starting from the promise that an incident is a result of a change ( voluntary or involuntary) to a configuration, a device, a network, or an application, all these changes if monitored and reported on correctly can help create a good context to understand the root-cause analysis of the incident.

You can create an ML model that will help you predict the next outage, notify operation teams, and help reduce the downtime.

Suppose that you transformed the input data that you gathered from all your sources, organized it into dataset like the one below and used a supervised learning process to create an ML model :

Screen Shot 2018-09-18 at 10.20.52 PM

your model will be able to make predictions of future incidents when fed with real-time input coming from your tools and logs :

Untitled Diagram (7)

over time, with more data, your model will get better at detecting future anomalies, with much more accuracy.

in conclusion

There is a lot of writing out there about AIOps, but the application, in my opinion, is a bit harder.

For different reasons, one being the spectrum of toolset in IT operations is very wide, and two being that the data structures are different from one organization to another, which means that trying to put a generic machine learning process to produce insights, will be at worst impossible and at best will lack accuracy.

For an organization to be able to get intelligent insights from  AIOps, there has to be an internal effort to train your models, because the quality of your future prediction of major incidents will essentially depend on the quality of your training and test sets.




Links :

Deploying Apps and ML models on mesosphere DC/OS

Have you ever thought of your data centers and cloud infrastructure ( private and public ) as one big computer? where you can deploy your applications with a click of a button, without worrying too much about the underlying infrastructure? well … DCOS allows you to manage your infrastructure from a single point, offering you the possibility to run distributed applications, containers, services, jobs while maintaining a certain abstraction from the infrastructure layer, as long as it provides computing, storage, and networking capabilities.

After deploying my ML model on a kubernates Cluster, a lambda function, I will deploy it on a DCOS cluster.

what is DCOS:

DCOS is a datacenter operating system, DC/OS is itself a distributed system, a cluster manager, a container platform, and an operating system.

DC/OS Architecture Layers

DCOS manages the 3 layers of software, platform, and infrastructure.

the dashboard :

Screen Shot 2018-09-03 at 7.43.36 PM

the catalog:

DCOS UI offers a catalog of certified and community packages that the users can install in seconds , like kafka, spark, hadoop, MySQL ..



Deploying Apps and ML models on DCOS :

the application I’m deploying is a web server running the model I created in my previous posts to make predictions.

DCOS relies on an application definition file that looks like this :

app.json :

    "volumes": null,
    "id": "mlpregv3",
    "cmd": "python",
    "instances": 1,
    "cpus": 1,
    "mem": 128,
    "disk": 0,
    "gpus": 0,

    "container": {
        "type": "DOCKER",
        "docker": {
            "image": "mbenachour/dcos-mlpreg:1",
            "forcePullImage": false,
            "privileged": false,
            "network": "HOST",
            "portMappings": [
                { "containerPort": 8088, "hostPort": 8088 }


the rest of the code can be found in my GitHub repo

after you configure your DCOS CLI and log in, you can run this command :

Screen Shot 2018-09-03 at 8.01.37 PM

if we take a look at the UI we can see that app/web server has been deployed :

Screen Shot 2018-09-03 at 8.03.35 PM

Blog at

Up ↑