In this blog series so far, I have presented the concepts behind a music recommendation engine, a music recommendation model for TensorFlow, and a GCP architecture to make it accessible via the web. The end result has been an ML model wrapped in a stand-alone service to give you predictions on-demand.

Before diving further into implementing more complicated ML models, I thought it would first be worth looking into how we could deploy our TensorFlow model into AWS. After some investigation, I’ve concluded that the better way is to use Lambda functions. In this post, I’ll explain why that’s the case, how you can do it, and an interesting pain point you have to keep in mind.

Let’s break the new ground!


Why Lambdas?

So let’s assume we have a TensorFlow model that is able to take in the IDs of the songs listened to by a user, and returns back recommended song IDs. Our task is to deploy this model on the AWS platform and start to receive the recommended songs IDs from it.


The first option that might spring to mind is to have an EC2 virtual instance that has access to our model and runs TensorFlow Serving on top of it.

The upsides of using an EC2 instance include:

  • We get to communicate with the TensorFlow model through the gRPC protocol
  • If we change our model, the serving of the changed model is reflected instantly, with zero downtime
  • We get to use GPUs for our TensorFlow model if needed

However, this option also has downsides:

  • We have to self-manage our EC2 server
  • We pay for up-time of our EC2 instance, although it may not be used
  • If there is an escalation in the number of requests, we have to manually manage our EC2 hardware configuration, or create a cluster of EC2 instances and serve the same model on each and every instance, with a load balancer in-front to spread the requests among them

If the upsides outweigh the downsides for you, by all means, go with EC2 (and for more information on serving up your model with TensorFlow Serving, check out my previous blog). However, if it all seems like too much trouble, the second option is to use a Lambda function which calls the model directly.

Lambda functions have the advantage of auto-scaling, and as a result, you don’t spend your time manually handling your instances and network. Furthermore, they make it easier to guarantee that your service will always be up and ready to use. You also pay for the amount of usage, rather than for the servers you have dedicated to serving your model.

Lambda functions can be automatically triggered by changes in your data store, when real-time data streams into your system, or (in our case) via a RESTful web service gateway. The main disadvantage of Lambdas is that you cannot use GPUs, but this is not a big deal for our prediction model, which is pretty quick even without them.

How to use TensorFlow with Lambda


So now let’s dive into how we can serve our model with Lambda functions. Lambda functions can be written in a few different languages, of which we’ll pick Python.

Our goal is to implement a RESTful web service that receives listened song IDs, feeds them into our model, and returns to the client the recommended song IDs it gets backs.

We’ll break this task into several steps. We’ll start by preparing the TensorFlow model and writing code so that we can feed inputs to it. We’ll then implement the Lambda function itself, which calls the model to receive recommendations.  We’ll then package our deployment project, which includes some non-standard dependencies. Finally, we’ll deploy the package with Serverless toolkit.

Model and input preparation

We’re using a matrix factorization model, the details of which will be the subject of later blogs. Furthermore, we can assume that the model has been trained and saved somewhere. This means that, for now, it is adequate to see the model as a black box with listened song IDs as input and recommended song IDs as output.

At prediction time, the inputs to the model should be in a format that the model can comprehend. In order to see the model’s input and output, the following TensorFlow command can be executed:

saved_model_cli show --dir $MODEL_DIR --all

For our model, this is what the command returns:

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs: 

 The given SavedModel SignatureDef contains the following input(s):
 inputs['examples'] tensor_info:
 dtype: DT_STRING
 shape: (-1)
 name: input_example_tensor:0
 The given SavedModel SignatureDef contains the following output(s):
 outputs['probabilities'] tensor_info:
 dtype: DT_FLOAT
 shape: (-1, 500)
 name: TopKV2:0
 outputs['top_k'] tensor_info:
 dtype: DT_INT64
 shape: (-1, 500)
 name: ToInt64:0
 Method name is: tensorflow/serving/predict

This shows that we should call the tag ‘serve’ and its input parameter name is ‘input_example_tensor:0’ which is a string. The model can return two variables as output: the ‘probabilities’ and ‘top_k’. We want to get the top_k variable, which encompasses the recommended song IDs and is an integer array with 500 elements, corresponding to the top 500 recommended song IDs for the user. The name ‘ToInt64:0’ is how we can access it in the code.

The reason for input_example_tensor being of type string is that we have used the TensorFlow Transform library for the data preparation. The tensor parameter is a serializable string made from a list of dictionaries. Each element in the list corresponds to an example that we send to the model to ask for recommendations for. The dictionary has only one key for each request:  ‘listened_song_ids’ which is a list of integers.

To convert our listened song IDs list into the input of the model, we can use the following Python code:

import tensorflow as tf
from tensorflow_transform import coders as tft_coders
from tensorflow_transform.beam import impl as tft
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import dataset_metadata 

QUERY_LISTENED_SONG_IDS = 'query_listened_song_ids'

def prepare_input(listened_songs):
    feature = construct_feature(listened_songs)
    feature = tensorflow_encode(feature)
    return feature

def construct_feature(listened_songs):
    feature = {}
    feature[QUERY_LISTENED_SONG_IDS] = listened_songs
    return feature

def tensorflow_encode(features):
    features = [features]
    encoded_str = ''
    with tft.Context(temp_dir='/tmp'):
        raw_metadata = dataset_metadata.DatasetMetadata(schema=make_prediction_schema())
        transform_fn = ((features, raw_metadata) | 'Analyze' >> tft.AnalyzeDataset(preprocessing_fn))
        (dataset, metadata) = (((features, raw_metadata), transform_fn) | 'Transform' >> tft.TransformDataset())
        coder = tft_coders.ExampleProtoCoder(metadata.schema)
        encoded_str = coder.encode(dataset[0])
    return encoded_str

def preprocessing_fn(input):
    return input

def make_prediction_schema():
    prediction_columns = [QUERY_LISTENED_SONG_IDS]
    return make_schema(prediction_columns, [
        [tf.int64]], [None])

def make_schema(columns, types, default_values):
    result = {}
    assert len(columns) == len(types)
    assert len(columns) == len(default_values)
    for c, t, v in zip(columns, types, default_values):
        if isinstance(t, list):
            result[c] = tf.VarLenFeature(dtype=t[0])
            result[c] = tf.FixedLenFeature(shape=[], dtype=t, default_value=v)
    return dataset_schema.from_feature_spec(result)

Do not worry if you have not encountered TensorFlow Transform library and you do not fully get every single line of code here. The code above basically prepares the data in the same format as requested by your input to the model. Note that library requires access to a temp directory in order to perform some internal data manipulation. We have nominated /tmp, as we know that Lambda functions will have access to that.

Implementing the Lambda function

The Lambda function loads the model from a directory, queries it by sending the listened song IDs, and finally returns back a JSON string. Sounds easy enough:


cur_dir = os.path.dirname(os.path.realpath(__file__))

def recommend(event,context):
        k =get_no_requested_recoms_from_url(event)
        with tf.Session() as sess:
            loader = tf.saved_model.loader.load(sess, ['serve'], '/tmp/tf_model')
            graph = tf.get_default_graph()
            top_k = graph.get_tensor_by_name("ToInt64:0")
            # probabilities = graph.get_tensor_by_name("TopKV2:0")
            example = prepare_input(songIds)
            top_k_recoms =, feed_dict={'input_example_tensor:0': [example]})[0][:k]
            return return_lambda_gateway_response(200, {'recommended_song_ids': str(top_k_recoms) })
    except Exception as ex:
        error_response={ 'error_message' :"Unexpected error" , 'stack_trace' : str(ex)}
        return return_lambda_gateway_response(503,error_response)

def download_model_to_tmp():
    if not os.path.exists('/tmp/tf_model'):

 if os.path.isfile('/tmp/tf_model/saved_model.pb') !=True:
    s3 = boto3.client('s3')
    s3.download_file(BUCKET_NAME,'tf_model/variables/variables.index' ,'/tmp/tf_model/variables/variables.index' )

def return_lambda_gateway_response(code, body):
    return {"statusCode": code, "body": json.dumps(body)}

def get_song_ids_from_url(event):
    songsStr = params['songIds']
    songIds = songsStr.split(',')
    songIdsInt=[int(x) for x in songIds]
    return songIdsInt

def get_no_requested_recoms_from_url(event):
    if 'k' in params:
        return int(k)
        return 10

In the first few lines, we have added the directory where the non-standard libraries like TensorFlow and TensorFlow Transform are kept. This allows the Lambda function to access them.

Next comes the ‘recommend’ function. This is executed when the web service is called. The event parameter contains the URL parameters in its queryStringParameters. We will fetch the listened songIds and the (optional) number of recommended songs (k) first, then load and call the TensorFlow model. Finally, a JSON string is generated from the recommended songs and returned as the output.

Just a side note: on top of the ‘recommend’ method, the download_to_tmp function is called. This will check and see if the model exists in the /tmp directory. If it doesn’t, it will go about downloading the TensorFlow model from the s3 storage into the /tmp directory.

Lot’s of coding huh!? But stay with me. We’re paving the road to the AWS cloud, but this is not the end of the story. Next is deploying our service on AWS.


Preparing the deployment packagebag-bags-blank-5957.jpg

So we now have our Python implementation of the Lambda function. Our next step is to prepare the package we are sending to the cloud. This should include the dependencies we need.

Here comes the painful part: the maximum size of the deployment package is 250MBs. We’ve already avoided unnecessary package bloat by loading the model from an S3 bucket at runtime, rather than embedding it in the package. However, TensorFlow itself (and all of its dependencies) still blows the package size out to about 350MBs.


Minimal dependency Packaging


To achieve a small-sized package, we need to minimise the size of libraries. TensorFlow is huge, and we don’t need all its parts. Because of size limitation, packaging a lambda function with TensorFlow dependency resembles packing all of your furniture into a Mini-Minor. We’re going to have to get rid of everything unnecessary!

The first thing that comes to mind is to remove dependencies that are not used at the time of prediction. This means deleting some heavy directories like tensorboard, external, GCP based ones, etc.

However, this is still not sufficient. We need to go further and remove the ‘contrib’ packages from TensorFlow. These contain very useful computational modules, but we are not using them at prediction time. So let’s throw them away and comment out the lines in where they get imported.

Another large-sized directory is the ‘include’ directory in TensorFlow. Surprisingly, getting rid of this directory does not hurt either.

At the end of the day, it’s possible to get the package down to less than 250MBs and still be able to have the model generate predictions. Well-done!


Deployment with Serverless


Serverless is a CLI toolkit that can be used to deploy serverless projects on GCP and AWS. We’re going to use it to deploy our AWS Lambda function.

A YAML file is used to configure deployment attributes like the AWS region, permissions, the Python method to be called by the Lambda triggers, and the triggering events. In our case, we configure the Lambda function to respond to RESTFUL get requests. The serverless.yml looks like this:

frameworkVersion: "1.2.1"

service: matfact-recommender

    - node_modules/**
    - .ephemeral/**
    - package.json
    - package-lock.json

  name: aws
  runtime: python2.7
  stage: dev
  region: us-east-1
    - Effect: Allow
    - s3:GetObject
  Resource: "arn:aws:s3:::BUCKET_NAME"

    handler: handler.recommend
      - http
        path: /predict
        method: get

The Lambda function being invoked is the ‘recommend’ method, which is defined in Also, the required permission to loading the TensorFlow model from our s3 bucket is defined here. Finally, in the root directory, we have a modules directory which keeps all the library dependencies.

The deployment kicks off by running the ‘serverless’ command (which assumes you have also installed and configured awscli):

 serverless deploy 

This will zip your directory, send it to the s3 and unzips, then deploys the lambda function. You can then call the lambda function’s URL and pass the songIds and the optional k parameter and get the JSON response. A simple response is shown below.

"recommended_song_ids": [ 1569 1097 9456 9206 11641] 


Congratulations, your serverless service is up and running!

Comparison to GCP

In this post I started by looking at options for deploying TensorFlow models to AWS. I explained how, if you have to host TensorFlow Serving yourself, you’ll need to manually configure instances yourself, and will also be in charge of scaling. You also need to watch carefully the costs of having dedicated servers. If you need to take a similar approach with GCP Compute Engine, you’ll have the same issues.

We saw that this makes serverless and auto-scalable solutions very appealing, and I covered in detail how you’d set up such a solution using AWS Lambda functions. However, whilst you could probably emulate a similar approach in the Google Cloud using Google Cloud Functions, if you’re going with Google you might as well just use Cloud ML Engine, which lets you copy a TensorFlow model to Cloud Storage and run it directly. Furthermore, Cloud ML Engine uses powerful GPUs and a computational cluster specifically configured to be used for ML projects.

In short, Google’s offerings are more advanced, but if you absolutely need to run in Amazon’s cloud, it’s certainly possible. It’s a testament to TensorFlow’s portability and flexibility, and kudos should go to Google for building TensorFlow this way.

What to expect next


In the journey of our blogs so far, we’ve gotten acquainted with the concepts of recommender engines, developed a simple content-based similarity model implementation, and seen how it can be used for prediction, using either TensorFlow Serving or serverless infrastructure.

But whilst this opens the door to the machine learning world to developers, we haven’t really focussed on the models and how you train them. In upcoming blogs, we will shift our attention to the mathematics and models of machine learning, and training rather than prediction. This is where ML really gets interesting. Don’t touch that remote!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s