Power generation prediction using realtime inference with Spark Structured Streaming and Kafka

A modern power plants generate an enormous amount of data, with the ever growing number of high frequency sensors, some of the applications of this data are : the protection against problems induced by combustion dynamics for example, and help improve the plant heat rate, and optimize power generation.

If we take a sensor that monitors the combustion process at a frequency of 25kHz it could generate up to 10Gig bytes of data a day, scale that to hundreds ( sometimes thousands ) and you will end up with something in the order of terabytes a day.

In the energy industry, independent power producers with big fleet of generation units are gathering massive quantities of data from their plants, whether it’s about reporting and dashboarding, or making recommendation in realtime to improve the generation, or simply alerting about a potential issue, realtime processing of streaming data is and should be an integral part of this data pipeline.

Kappa vs Lambda Architecture :

generally there are two approaches to realtime streaming, lambda and kappa architectures

lambda architecture :

Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides access to batch-processing and stream-processing methods with a hybrid approach. Lambda architecture is used to solve the problem of computing arbitrary functions. The lambda architecture itself is composed of 3 layers : Batch, Speed, and Serving.

batch layer manages historical data and handles all the computation and transformations on the data including machine learning inference

the speed layer is supposed to handle all low-latency and realtime queries and

Kappa Architecture :

The Kappa Architecture is used for processing streaming data. The main premise behind the Kappa Architecture is that you can perform both real-time and batch processing, especially for analytics, with a single technology stack. It is based on a streaming architecture in which an incoming series of data is first stored in a messaging engine like Apache Kafka. From there, a stream processing engine will read the data and transform it into an analyzable format, and then store it into an analytics database for end users to query.

Both architectures are also useful for addressing “human fault tolerance,” in which problems with the processing code (either bugs or just known limitations) can be overcome by updating the code and running it again on the historical data. The main difference with the Kappa Architecture is that all data is treated as if it were a stream, so the stream processing engine acts as the sole data transformation engine.

Example :

if you want to take a look at the code first, here are the links to the notebook and kafka producer code.

databricks notebook

Code : git repo

To to demonstrate realtime processing I will deploy a linear regression model that predicts the power generation in MWs based on sensors data.

I will be using the following Combined Cycle Power Plant Data Set , the dataset collects 9568 datapoints over 5 years ( 2006 to 2011) and these columns :

Features consist of hourly average ambient variables

  • Temperature (T) in the range 1.81°C and 37.11°C,
  • Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
  • Relative Humidity (RH) in the range 25.56% to 100.16%
  • Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
  • Net hourly electrical energy output (EP) 420.26-495.76 MW

I will use these frameworks :

Kafka ( confluent ) : as an upstream data source

Spark Structured Streaming as processing engine

Sparl.Ml to train the model

Data Exploration :

Histogram of all features and labels

Training :

After downloading the data, we select the features columns, we create an assembler Vector to gather all the features, then split the data to test and train, create Linear Regression – using featuresCOl and LabelCol

after the model is trained, we measure Model metrics ( MAE. RMSE, R2 )

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

#After downloading the data, we select the features columns
feature_columns = df.columns[:-1]

# Create an assembler Vector
assembler = VectorAssembler(inputCols=feature_columns,outputCol="features")
df2 = assembler.transform(df)
df_train = df2.select("features","PE")

# Split the data to test and train
train, test = df_train.randomSplit([0.7, 0.3])

# Create Linear Regression - using featuresCOl and LabelCol
lr = LinearRegression(featuresCol="features", labelCol="PE")
model = lr.fit(train)

#Measure Model metrics
evaluation_summary = model.evaluate(test)


Prediction on test set :

the transform method creates an extra column called prediction.

Predictions on streaming data

Power plants like combined cycle gas plants or coal or nuclear tend to generate an enormous amount of sensor data.

The Lambda architecture approach is to collect, process, clean the data and store it in a datalake waiting for the next batch to run the inference on it, the second approach (kappa architecture) is to run the model on the data while as it’s streaming. In this example we will use the following these main frameworks to run realtime predictions :

  • Kafka : a data streaming framework allowing producers to write the data upstream and consumer to read downstream
  • Spark SQL : Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
  • Spark Structured Streaming : scalable fault tolerant streaming processing engine built on top of SparkSQL provides the possibility to run transformation and ML models on streaming data

Here is the code to initiate a read stream from a kafka streaming cluster running on Host : plc-4nyp6.us-east-1.aws.confluent.cloud:9092

the ReadStream() method create a dynamic dataframe, that will hold values captured from kafka stream over time.

dfs = spark.readStream .format("kafka") .option("includeHeaders", "true") .option("kafka.bootstrap.servers", "plc-4nyp6.us-east-1.aws.confluent.cloud:9092") .option("subscribe", "ccgt_sensors") .option("startingOffsets", "latest") .option("kafka.security.protocol","SASL_SSL") .option("kafka.sasl.mechanism", "PLAIN") .option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"HXKG22IOPJ2RUDEJ\" password=\"zZLvDdDcBYFkolkaQJdvxrfyUuU/ju/egbgxD3+1GqH5g5mQ5ShhQ0PLh59sSv+xG\";") .load()
dfs2 = dfs.selectExpr("CAST(value AS STRING)")

this will read stream of input values as a string :

at this point we need to split the string to an array of strings based on the comma delimiter , and cast the list as Array<double>

Since model expect a vector of values to run the transformation ( prediction ) , I created a UDF (User Defined Function) to convert list of doubles to a vector using Vectors.dense() method

#cast string inouts as an array of doubles 
values = dfs2.select(
       split(dfs2.value, ",").cast("array<Double>").alias("inputs")

# create UDF to convert features list to   
conv_vec = udf(lambda vs: Vectors.dense([float(i) for i in vs]), VectorUDT())

now that we have stream input data as a vector we can run predictions :

def predict_mw(sdf):
  mw = model.transform(sdf) 
  return mw

df_out = predict_mw(new_df)


the result is a realtime prediction of MWs generated by the power plant based on the input features

Deploy PJM Electricity Load Forecast Model on AWS SageMaker

in the energy industry, forecasting the grid load is vital for various commercial optimizations around Day Ahead and Real Time trading, but it also help Independent Power Providers (IPPs) allocate the right generation unites.

we talked previously about energy markets, CAISO, PJM, ERCOT , and others. In this article the goal is not to talk about the accuracy of the model in predicting load, but to highlight AWS SageMaker way of deploying ML models .

The Dataset :

You can use PJM tool dataminer to extract the load , dataminer is PJM’s enhanced data management tool, giving members and non-members easier, faster and more reliable access to public data formerly posted on pjm.com.

PJM dataMiner

the initial dataset will look something like :

to simplify this example we can delete the middle two columns to just keep datetime and load in MW

Splitting Training and Test Data

we will split data into training and test

Training Data : 6578

Test Data: 2021

Feature engineering

One transformation that will make on data is to create new features from datetime field :

  • hour of the day
  • day of the week
  • month
  • quarter
  • year
  • day of the month
  • day of the year
  • week of the year
# Create features from datetime index
def create_features(df, label=None):
    df['date'] = df.index
    df['hour'] = df['date'].dt.hour
    df['dayofweek'] = df['date'].dt.dayofweek
    df['month'] = df['date'].dt.month
    df['quarter'] = df['date'].dt.quarter
    df['year'] = df['date'].dt.year
    df['dayofyear'] = df['date'].dt.dayofyear
    df['dayofmonth'] = df['date'].dt.day
    df['weekofyear'] = df['date'].dt.weekofyear

Building Model with XGBoost

what is XGBoost :

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

XGBoost algorithms are supervised machine learning that has a classifier and a regressor implementation

in this instance we will use xgboost regressor for predictions :

reg = xgb.XGBRegressor(n_estimators=1000)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose = True)

Results of prediction

AWS Endpoint Deployment

to be able to deploy the model on AWS we must :

  • save the model in s3
  • create a container with model file
  • create a sagemaker Endpoint configuration
  • create sagemaker Endpoint

Saving Model to s3

region = 'us-east-1'
bucket = "pjm-load-forecast"
prefix = 'sagemaker/pjm-forecast-xgboost-byo'
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)

fObj = open("model.tar.gz", 'rb')
key= os.path.join(prefix, model_file_name, 'model.tar.gz')

Creating Model / Endpoint configuration / Endpoints

importing XGBoost image

import sagemaker
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "1.2-1")

creating Model

primary_container = {
    'Image': container,
    'ModelDataUrl': model_url,
role = get_execution_role()
create_model_response2 = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

Once created the models are saved to AWS Sagemaker

Create endpoint configuration

endpoint_config_name = 'PJM-LoadForecast-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,

Same goes for Endpoint configuration, once created you can view them on AWS console

Deploy Endpoint

endpoint_name = 'PJM-LoadForecast-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = sm_client.create_endpoint(


SageMaker allow you to validate your endpoint , here a code snippet for that:

file_name = 'test_point.csv' 
with open(file_name, 'r') as f:
    payload = f.read().strip()
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
result = response['Body'].read().decode('ascii')

Notebook : you can find the notebook for this article on github


AWS Sagemaker has built-in algorithms for both supervised and unsupervised machine learning models. it provides a great platform for training, and deploying machine learning models into a production environment on AWS. By combining this powerful platform with the serverless capabilities of Amazon Simple Storage Service (S3), Amazon API Gateway, and AWS Lambda, it’s possible to transform an Amazon SageMaker endpoint into a web application that accepts new input data, potentially from a variety of sources, and presents the resulting inferences to an end user.

Predicting Electricity Wholesale Prices using AWS Machine Learning

Electricity Wholesale Markets

Most of the nation’s wholesale electricity sales happen in a competitive market managed by Independent System Operator (ISO), with over 200 million customers in these areas and over $120 billion in annual energy transactions taking place. Under the Federal Power Act, these markets are overseen by the Federal Energy Regulatory Commission (FERC), which ultimately determines the guidelines for how wholesale electricity is bought and sold in the marketplace. RTOs/ISOs create the market rules that enforce whether and how energy resources can compete. Wholesale markets should allow all resources to compete on price and performance, as the Federal Power Act requires that the rates, terms, and conditions of service governing wholesale competitive markets be “just and reasonable”.



Predicting Energy Prices

In a competitive market, being able to predict energy prices does give IPPs a great advantage in formulating their bids for the day ahead market.

The DA is a purely financial market, which means even financial institutions who are not a power producer can hedge and make profit from buying/selling bulk energy in DA market.

Energy Price Forecasting ( EPF ) has become very instrumental in the decision making process for day to day energy bids, but also for creating a point of view (POV) for long term investments

There multiple models and methods used in creating price forecast for electricity :

  • Multi-agent model : build prices forecasting based on matching demand to supply, by simulation the operations of heterogeneous system of generation units and companies
  • Fundamentals model : which focuses on simulation the physical and economic relationship influencing the trading of electricity such as : weather, fleet conditions, load ..
  • Statistical model : uses mathematical regression models which is basically a combination of previous day/month/year prices and other input variables like weather or load
  • Computational intelegence model : this class of models uses mainly deep neural network, or support victor machine methods to predict prices , these methods are good at covering the non-linear aspect of a price curve, from a price spike point of view.
  • Hybrid model: which is a combination of two or more of the above models.


We will use the Computation intelligence model based on Fritz Arnold proposal for this explanatory exercise.

Proposal :

In this paper, the author explores different approaches to create an accurate prediction of energy wholesale prices, one of them which will make the subject of this article is to use purely time series data from previous years to learn about long (one year ) and short ( one day ) term patterns of prices


Historical hourly Day-Ahead market clearing prices for the German bidding zone.

Platform :

We will use Amazon SageMaker as our data science and machine learning platform :

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.

Traditional ML development is a complex, expensive, iterative process made even harder because there are no integrated tools for the entire machine learning workflow. You need to stitch together tools and workflows, which is time-consuming and error-prone. SageMaker solves this challenge by providing all of the components used for machine learning in a single toolset so models get to production faster with much less effort and at lower cost.

Platform Architecture :



SageMaker Studio unifies at last all the tools needed for ML development. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity.



Timeseries analysis

loading the data: we read data input from csv file, which consists of historical prices

Screen Shot 2020-04-08 at 11.14.53 PM

Visualizing historical: 10 years od DA prices

Screen Shot 2020-04-08 at 11.25.06 PM

Model Architecture :

This model is hybrid and will use a combination of one convolutional neural network, with a kernel and a stride of 24 corresponding to the number of hours in the day, and a Long Short Term Memory deep neural network, to help the model learn about price pattern over long and short periods of time.



Screen Shot 2020-04-08 at 11.51.01 PM



For one year :

Screen Shot 2020-04-08 at 11.57.06 PM

For 2 weeks :

Screen Shot 2020-04-08 at 11.58.08 PM

Deploying the Model:

Once the model is trained and saved, we can use AWS endpoints to deploy as an API.

API backed deployment basically wraps the model in a web application to make available for inferences.


A EPF computational intelligence model that is based only on timeseries analysis does provide good results when it comes to learning the general patterns of prices, seasonality of surges, quarterly, and monthly patterns as well, weekday vs weekend, but it doesn’t perform well when it comes daily price surges which are critical to energy traders, as these impulses are not very frequent but when they happen they offer a great revenue opportunity for the Real Time market.

The rest of the project does provide a solution to help mitigate this issue, which consists of using MLP (MultiLayer perceptron ) to be able to feed the model different inputs like  weather, load, etc … and predict prices. The accurracy of the model improves when it comes to predicting price spikes, but as stated above, the computational intelligence models are not probabilistic in nature.


Using Smart Meter Data for consumer electricity usage forecasting

Electric energy consumption is essential for promoting economic development and raising the standard of living. In contrast to other energy sources, electric energy cannot be stored for large-scale consumption. From an economic viewpoint, the supply and demand of electric energy must be balanced at any given time. Therefore, a precise forecasting of electric energy consumption is very important for the economic operation of an electric power grid.

The ability to create a forecasting model for an individual consumer can help determine the overall load on the grid for a given time.

For this post, we will classify this as a time series prediction problem and only use one variance.  Check back for a post where introduce mutli variance.

The Dataset :

I used Smart Meters Texas or SMT portal to access my home electricity usage. SMTstores daily, monthly and even 15-minute intervals of energy data. Energy data is  recorded by digital electric meters (commonly known as smart meters), and provides secure access to the data for customers and authorized market participants

In addition to acting as an interface for access to smart meter data, SMT enables secure communications with the customers in-home devices and provides a convenient and easy-to-use process where customers can voluntarily authorize market participants, other than the customer’s retail electric provider or third parties, access to their energy information and in-home devices.

Tools :

Keras: Keras is an open source neural network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible

TensorFlow : TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library and is also used for machine learning applications such as neural networks

The Algorithm: LSTM

In our case, we will be using a variant of Recurrent Neural Network ( RNN ) called Long Short Term Memory ( LSTM), why? time series problems are a difficult type of prediction modeling, LSTMs are good at extracting input features patterns when it’s over a long period of time, LSTM have the capability of retaining it in memory, and use it to predict next sequences, this what my 3 months usage (in KWH) look like ( by 15 min interval ) :

Screen Shot 2019-02-10 at 2.31.38 PM

for this example, I will only use a subset of the overall dataset, 3 days of electricity usage:


11/01/18 00:15,0.005
11/01/18 00:30,0.005
11/01/18 00:45,0.013
11/01/18 01:00,0.029
11/01/18 01:15,0.025
11/01/18 01:30,0.004
11/01/18 01:45,0.005
11/01/18 02:00,0.004
11/01/18 02:15,0.024


Screen Shot 2019-02-10 at 2.44.35 PM

python :

loading the dataset:


# load the dataset

dataframe = read_csv(‘AMS.csv’, usecols=[1], engine=’python’)

dataset = dataframe.values

dataset = dataset.astype(‘float32’)


Training Set and Test Set :


train_size = int(len(dataset) * 0.67)

test_size = len(dataset) – train_size

train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]


creating model :


model = Sequential()

model.add(LSTM(4, input_shape=(1, look_back))) model.add(Dense(1)) model.add(Activation(‘tanh’)) model.compile(loss=’mean_squared_error’, optimizer=’adam’) model.fit(trainX, trainY, epochs=50, batch_size=1) [/code]

making predictions:


trainPredict = model.predict(trainX)

testPredict = model.predict(testX) [/code]

for plotting we shift the training and test sets :

# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

if results were plotted, it would look something like the picture below:

Blue: actual usage

Orange: Test Set

Green: Predictions

Screen Shot 2019-02-11 at 8.27.33 AM

if we zoom on the prediction part :

Screen Shot 2019-02-11 at 8.34.54 AM

this is good forecasting, with a 0.19 RMSE

Summary :

With the electricity market undergoing a revolution, load forecasts have gained much more significance spreading across other business departments like energy trading and financial planning. Accurate load forecasts are the basis for most reliability organizations operations, like Electricity Reliability Consul Of Texas more commonly known as ERCOT. Accurate load forecasting will become even more important with Smart Grids. Smart Grids create the opportunity to proactively take action at the consumer level, storage level, and generation side, to avoid a situation of energy scarcity and/ or price surge.

My books recommendations for AI and ML

IMG_6567 copy

With AI and machine learning creeping in every industry, the paradigm of IT as we know it is also changing, ML is a natural extension of the software revolution we have seen in the last decades, knowing how to utilize ML in your industry will be a key element for success and growth in the coming years.

This transformation will need a new vision, as new jobs, new platforms and new ways of doing business will emerge from it. I believe at this point we are past the hype of AI and we are in the middle of a reality where machine learning and inference are helping thousands of businesses grow and prosper.

I have read several books on AI and ML, and the two that stands out are:

  • Human + Machine , reimagining work in the age of AI
  • Pragmatic AI : an introduction to Cloud-Based Machine learning.

either you are an engineer, a manager, an executive, or merely driven by curiosity about AI and ML, I recommend that you read these books to fully grasp its impact on many industries

Human + Machine, reimagining work in the age of AI

Paul R. Daugherty and H. James Wilson did an amazing job at reimaging what work will look like in the age of AI, they introduced the notion of the Missing Middle; a realistic approach of looking at this transformation by defining what machine can do, what humans can do, and  where humans and machines have hybrid activities.

Humans can judge, lead, empathize and create, machines can iterate, predict and adapt.

AI can give humans superpowers, but humans need to train, sustain machines, and at times explain its decisions.

Paul and James talk about an entirely new set of jobs that will emerge from this alliance.

Pragmatic AI : an introduction to Cloud-Based Machine learning

if you are an engineer who likes to understand how the training and the inference work under the hood, this book would be a great resource for you.

Pragmatic AI explains how you can utilize cloud resources in AWS Azure, and GCP to train your models, optimize them, and deploy a production scale machine learning powered application.

the book also contains real applications and code samples to help reproduce it on your own, and it covers the following topics :

  • AI and ML toolchain: from python ecosystem toolchains like numpy,  Jupiter Notebooks and others to the tools available on AWS GCP and Azure
  • DevOps practices to help you deliver and deploy
  • Creating practical AI applications from scratch
  • Optimization


there are definitely a lot of publications concerning AI and ML, but the combination of the two books above will cover the organizational and structural challenges that an organization will face when it comes to adopting AI, and also the technical backgrounds needed to work with it.

Storing your ML Models with parameters

Often when training machine learning models you find yourself creating different estimators and tuning this parameter or that to get the results you want, you may also find yourself wanting to save the results of those iterations, to save you time in the future.

that’s what I’m trying to address in this post, having some sort of artifact repository for machine learning models, but saving your parameters as metadata using the following design :

Screen Shot 2018-12-09 at 8.38.33 PM

1: user uploads artifacts using pre-signed s3 URLs

2  and 3: a putObject event triggers the lambda function to make an API call to an ec2 instance running an HTTP server  to read the estimator from S3 and get the parameters

4: saving the parameters in DynamoDB


Uploading artifacts :

I use AWS S3 to store the assets, making use of the pre-signed URL feature that gives you the possibility to use temporary URLs to upload files to S3, which takes away the managing permissions.

to orchestrate all this I like to use my favorite serverless framework.

here is the code on github

Deploying Serverless stack :

[code]$serverless deploy[/code]

Screen Shot 2018-12-10 at 1.19.57 PM

this will create 5 endpoints :

  • POST /dev/asset
  • GET /dev/asset
  • PUT /dev/asset/{asset_id}
  • DELETE /dev/asset/{asset_id}

these endpoints will allow you to update/create/delete an artifact, in this case, is a model.

for more reading about this check out the readme page of this serverless example

Getting the parameters :

in this part, in the EC2 instance, we will try to download the model and get the parameters to store them in dynamo DB

Initially I thought I could leverage all of this work in Lambda, so I don’t have to create an EC2 instance just to read the parameters, unfortunately, there couple issues with that solution, one of them is the size of the dependencies once you add SKlearn libraries as a dependency, the size of the lambda zip reaches 60Mb. But once uploaded there was an issue running SKlearn part of the lambda, for this iteration I decided to use a t2.micro on EC2.

the EC2 has a python web server running that get requests with an asset_id, downloads the asset, get the parameters and store them in dynamodb

this is the code for the server :


Testing the upload :

to test all this I created a small python script:


import sys
import requests
from sklearn.externals import joblib

def upload(filename):
model = loadModel(filename)
print (model.get_params())
url = ‘https://oo0cl2av91.execute-api.us-east-1.amazonaws.com/dev/asset&#8217;
response = requests.post(url)
print (response)
presigned = response.json().get(‘body’).get(‘upload_url’)
response = requests.put(presigned, data=open(filename).read())
print (response)

def loadModel(model_path):
download_path = model_path
#s3_client.download_file(BUCKET_NAME, model, ‘/tmp/model.pkl’)
return joblib.load(download_path)



to run it use :

[code] python test.py  your_model.pkl [/code]

if you look at your dynamodb table you will see that your model has a description :

Screen Shot 2018-12-10 at 11.40.30 PM

Querying Ercot public dataset using AWS Glue and Athena

ERCOT is an acronym for Electric Reliability Council of Texas, it manages the flow of electric power to more than 25 million Texas customers — representing about 90 percent of the state’s electric load. As the independent system operator for the region, ERCOT schedules power on an electric grid that connects more than 46,500 miles of transmission lines and 600+ generation units. It also performs financial settlement for the competitive wholesale bulk-power market and administers retail switching for 7 million premises in competitive choice areas.

ERCOT also offers an online and public dataset giving market participants information on a variety of topics related to the market of electricity in the state of Texas, which makes it a good candidate for AWS products: Glue and Athena.

Tools :

AWS Glue 

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics

AWS Athena 

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run

Scraping Data :

the data on Ercot website is available as a collection of  .zip files, I used a python scraper from this github repository to only collect CSV files.

As an example, we will be collecting data about the total energy sold from this page

Using the previous tools the command would something like this :


python -m ercot.scraper “http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12334&reportTitle=DAM%20Total%20Energy%20Sold&showHTMLView=&mimicKey&#8221;


the script will download the csv files and store them in a data folder :

Screen Shot 2018-11-11 at 9.07.37 PM

At this point, we transfer the data to S3 to be ready for AWS Glue, an optimization of this process could consist of creating a lambda function with a schedule to continuously upload  new datasets

Creating a Crawler 

you can add a Crawler in AWS Glue to be able to traverse datasets in S3 and create a table to be queried.

Screen Shot 2018-11-11 at 9.31.44 PM


At the end of its run, the crawler creates a table that contains records gathered from all the CSV files we downloaded from EROCT public dataset, in this instance the table is called: damtotqtyengysoldnp

Screen Shot 2018-11-11 at 9.26.13 PM


And now you can query Ahead! 

using AWS Athena, you can run different queries on the table we generated previously, here are some few examples :

Total energy sold by settlement point :

Screen Shot 2018-11-11 at 9.53.54 PM


Getting the hours of the day: 11/12/2018 with the Max  energy sold 

Screen Shot 2018-11-11 at 10.07.58 PM



How can AIOps help you prevent the next major incident.

What is it?

AIOps is a term that has been used in the last few years to describe the ability to drive intelligence from the day-to-day data that IT operations generate. The data source could vary from monitoring tools like SolarWinds to service desk tools like ServiceNow to automation tools like configuration management ( chef, puppet … ), or log search platforms like Splunk

Untitled Diagram (5)

One area where AIOps can be an asset to operation teams is incident predictability and remediation, there are others like storage and capacity management, resources utilization …

How can AIOPS help prevent the next outage :

the footprint of digital systems and businesses is increasing every day and so is the speed at which the data is produced.

For example, a Palo Alto firewall can produce up to 12 million events in one day, the manual correlation of data is nearly impossible, and that’s why we need an overview of the entire landscape of data produced by IT operations,  transformation of data to be able to serve as training and test sets for machine learning.

Starting from the promise that an incident is a result of a change ( voluntary or involuntary) to a configuration, a device, a network, or an application, all these changes if monitored and reported on correctly can help create a good context to understand the root-cause analysis of the incident.

You can create an ML model that will help you predict the next outage, notify operation teams, and help reduce the downtime.

Suppose that you transformed the input data that you gathered from all your sources, organized it into dataset like the one below and used a supervised learning process to create an ML model :

Screen Shot 2018-09-18 at 10.20.52 PM

your model will be able to make predictions of future incidents when fed with real-time input coming from your tools and logs :

Untitled Diagram (7)

over time, with more data, your model will get better at detecting future anomalies, with much more accuracy.

in conclusion

There is a lot of writing out there about AIOps, but the application, in my opinion, is a bit harder.

For different reasons, one being the spectrum of toolset in IT operations is very wide, and two being that the data structures are different from one organization to another, which means that trying to put a generic machine learning process to produce insights, will be at worst impossible and at best will lack accuracy.

For an organization to be able to get intelligent insights from  AIOps, there has to be an internal effort to train your models, because the quality of your future prediction of major incidents will essentially depend on the quality of your training and test sets.




Links :




Deploying Apps and ML models on mesosphere DC/OS

Have you ever thought of your data centers and cloud infrastructure ( private and public ) as one big computer? where you can deploy your applications with a click of a button, without worrying too much about the underlying infrastructure? well … DCOS allows you to manage your infrastructure from a single point, offering you the possibility to run distributed applications, containers, services, jobs while maintaining a certain abstraction from the infrastructure layer, as long as it provides computing, storage, and networking capabilities.

After deploying my ML model on a kubernates Cluster, a lambda function, I will deploy it on a DCOS cluster.

what is DCOS:

DCOS is a datacenter operating system, DC/OS is itself a distributed system, a cluster manager, a container platform, and an operating system.

DC/OS Architecture Layers

DCOS manages the 3 layers of software, platform, and infrastructure.

the dashboard :

Screen Shot 2018-09-03 at 7.43.36 PM

the catalog:

DCOS UI offers a catalog of certified and community packages that the users can install in seconds , like kafka, spark, hadoop, MySQL ..



Deploying Apps and ML models on DCOS :

the application I’m deploying is a web server running the model I created in my previous posts to make predictions.

DCOS relies on an application definition file that looks like this :

app.json :

“volumes”: null,
“id”: “mlpregv3”,
“cmd”: “python server.py”,
“instances”: 1,
“cpus”: 1,
“mem”: 128,
“disk”: 0,
“gpus”: 0,

“container”: {
“type”: “DOCKER”,
“docker”: {
“image”: “mbenachour/dcos-mlpreg:1”,
“forcePullImage”: false,
“privileged”: false,
“network”: “HOST”,
“portMappings”: [
{ “containerPort”: 8088, “hostPort”: 8088 }


the rest of the code can be found in my GitHub repo

after you configure your DCOS CLI and log in, you can run this command :

Screen Shot 2018-09-03 at 8.01.37 PM

if we take a look at the UI we can see that app/web server has been deployed :

Screen Shot 2018-09-03 at 8.03.35 PM

Deploy machine learning models on AWS lambda and serverless

in the last post, we talked about how to deploy a Machine learning trained model on Kubernates.

Here is another way of deploying ML models: AWS lambda + API gateway

Untitled Diagram (3)

Basically, your model (mlpreg.pkl) will be stored in S3, your lambda function will download the model use it to make predictions, another call will allow you to get the model hyperparameters, and sent it back to the user.

Screen Shot 2018-08-07 at 9.11.00 AM

to deploy AWS services, we will use a framework called Serverless

serverless allow you with a single configuration file to define functions, create resources, declare permissions, configure endpoints …

serverless uses one main config file and one or multiple code files :

  • handler.py : the lambda function
  • serverless.yml : serverless configuration file

here is what the serverless configuration file for this example would look like :


service: deploy-ml-service
– serverless-python-requirements
name: aws
runtime: python2.7
– Effect: Allow
# Note: just for the demo, we are giving full access to s3
– s3:*
Resource: “*”
handler: handler.predict
– http:
path: predict
method: post
cors: true
integration: lambda
handler: handler.getModelInfo
– http:
path: params
method: post
cors: true
integration: lambda[/code]

as described in the example we will create two functions one will make a prediction using the model we built in the last post, the other one will display the model hyperparameters :

  • predict
  • getModelInfo

to load the model we have  :

  • load_model : loading the stored model from S3


from sklearn.externals import joblib
import boto3

BUCKET_NAME = ‘asset-s3-uploader-02141’

def predict(event,context):
input = event[“body”][“input”]
modelName = event[“body”][“model_name”]
data = float(input)
return loadModel(modelName).predict(data)[0]

def loadModel(model):
s3_client = boto3.client(‘s3’)
download_path = ‘/tmp/model.pkl’
s3_client.download_file(BUCKET_NAME, model, ‘/tmp/model.pkl’)
return joblib.load(download_path)

def getModelInfo(event,context):
model = event[“body”][“model_name”]
return loadModel(model).get_params()

$Serverless Deploy ! 

yep that’s all it takes, and your services will be deployed in seconds:

Screen Shot 2018-08-06 at 9.25.59 PM

Run the tests:

getting the model Hyperparameters :


root@58920085f9af:/tmp/deploy# curl -s -d “model_name=mlpreg.pkl” https://abcefgh123.execute-api.us-east-1.amazonaws.com/dev/params | python -m json.tool
“activation”: “relu”,
“alpha”: 0.001,
“batch_size”: “auto”,
“beta_1”: 0.9,
“beta_2”: 0.999,
“early_stopping”: false,
“epsilon”: 1e-08,
“hidden_layer_sizes”: [
“learning_rate”: “constant”,
“learning_rate_init”: 0.01,
“max_iter”: 1000,
“momentum”: 0.9,
“nesterovs_momentum”: true,
“power_t”: 0.5,
“random_state”: 9,
“shuffle”: true,
“solver”: “adam”,
“tol”: 0.0001,
“validation_fraction”: 0.1,
“verbose”: false,
“warm_start”: false


Making Predictions :


root@58920085f9af:/tmp/deploy# curl -s -d “input=1&model_name=mlpreg.pkl” https://abcdefg123.execute-api.us-east-1.amazonaws.com/dev/predict | python -m json.tool