ERCOT is an acronym for Electric Reliability Council of Texas, it manages the flow of electric power to more than 25 million Texas customers — representing about 90 percent of the state’s electric load. As the independent system operator for the region, ERCOT schedules power on an electric grid that connects more than 46,500 miles of transmission lines and 600+ generation units. It also performs financial settlement for the competitive wholesale bulk-power market and administers retail switching for 7 million premises in competitive choice areas.
ERCOT also offers an online and public dataset giving market participants information on a variety of topics related to the market of electricity in the state of Texas, which makes it a good candidate for AWS products: Glue and Athena.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run
Scraping Data :
the data on Ercot website is available as a collection of .zip files, I used a python scraper from this github repository to only collect CSV files.
As an example, we will be collecting data about the total energy sold from this page
Using the previous tools the command would something like this :
python -m ercot.scraper "http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12334&reportTitle=DAM%20Total%20Energy%20Sold&showHTMLView=&mimicKey"
the script will download the csv files and store them in a data folder :
At this point, we transfer the data to S3 to be ready for AWS Glue, an optimization of this process could consist of creating a lambda function with a schedule to continuously upload new datasets
Creating a Crawler
you can add a Crawler in AWS Glue to be able to traverse datasets in S3 and create a table to be queried.
At the end of its run, the crawler creates a table that contains records gathered from all the CSV files we downloaded from EROCT public dataset, in this instance the table is called: damtotqtyengysoldnp
And now you can query Ahead!
using AWS Athena, you can run different queries on the table we generated previously, here are some few examples :
Total energy sold by settlement point :
Getting the hours of the day: 11/12/2018 with the Max energy sold