Often when training machine learning models you find yourself creating different estimators and tuning this parameter or that to get the results you want, you may also find yourself wanting to save the results of those iterations, to save you time in the future.
that’s what I’m trying to address in this post, having some sort of artifact repository for machine learning models, but saving your parameters as metadata using the following design :
1: user uploads artifacts using pre-signed s3 URLs
2 and 3: a putObject event triggers the lambda function to make an API call to an ec2 instance running an HTTP server to read the estimator from S3 and get the parameters
4: saving the parameters in DynamoDB
Uploading artifacts :
I use AWS S3 to store the assets, making use of the pre-signed URL feature that gives you the possibility to use temporary URLs to upload files to S3, which takes away the managing permissions.
to orchestrate all this I like to use my favorite serverless framework.
here is the code on github
Deploying Serverless stack :
[code]$serverless deploy[/code]
this will create 5 endpoints :
- POST /dev/asset
- GET /dev/asset
- PUT /dev/asset/{asset_id}
- DELETE /dev/asset/{asset_id}
these endpoints will allow you to update/create/delete an artifact, in this case, is a model.
for more reading about this check out the readme page of this serverless example
Getting the parameters :
in this part, in the EC2 instance, we will try to download the model and get the parameters to store them in dynamo DB
Initially I thought I could leverage all of this work in Lambda, so I don’t have to create an EC2 instance just to read the parameters, unfortunately, there couple issues with that solution, one of them is the size of the dependencies once you add SKlearn libraries as a dependency, the size of the lambda zip reaches 60Mb. But once uploaded there was an issue running SKlearn part of the lambda, for this iteration I decided to use a t2.micro on EC2.
the EC2 has a python web server running that get requests with an asset_id, downloads the asset, get the parameters and store them in dynamodb
this is the code for the server :
https://github.com/mbenachour/store_ml_models/blob/master/server.py
Testing the upload :
to test all this I created a small python script:
[code]
import sys
import requests
from sklearn.externals import joblib
def upload(filename):
model = loadModel(filename)
print (model.get_params())
url = ‘https://oo0cl2av91.execute-api.us-east-1.amazonaws.com/dev/asset’
response = requests.post(url)
print (response)
presigned = response.json().get(‘body’).get(‘upload_url’)
response = requests.put(presigned, data=open(filename).read())
print (response)
def loadModel(model_path):
download_path = model_path
#s3_client.download_file(BUCKET_NAME, model, ‘/tmp/model.pkl’)
return joblib.load(download_path)
upload(sys.argv[1])
[/code]
to run it use :
[code] python test.py your_model.pkl [/code]
if you look at your dynamodb table you will see that your model has a description :