In parts 1 and 2 of this blog series, we’ve seen how to implement an item-similarity model in TensorFlow, and the intuition behind various recommender models. It’s now time to have a high-level view of a recommendation project in the Google Cloud Platform. This will encompass all of our plumbing for the web service, so that it can be up and available on the web. I will outline two possible architectures – one where we deploy and manage TensorFlow ourselves using the Google Kubernetes Engine (GKE) , and the other using the fully-managed Cloud Machine Learning Engine (MLE).  You’ll also find how to communicate with the ML engine modules, and how to configure your computational clusters.

System architecture

By now, we have implemented a model that allows us to get recommendations for a user based on the songs they have listened to. Regardless of the details of the model, it is basically a TensorFlow model which is going to be hosted in GCP. Our approach will be to have a plug-and-play setting, meaning we will design our end service in a way that the recommender engine can be easily replaced.

In order to represent this flexibility in our architecture, we make use of two different cloud architectures which will differ in the ML components. The first architecture will contain a GKE cluster. This allows the highest flexibility but requires more manual work and self-maintenance. In contrast, the second architecture – which is the suggested architecture for an ML project in GCP – uses MLE to achieve auto-scalability with less configuration.

GKE-based architecture

There are different modules and technologies used to make the music recommendation service available. In the diagram below you can see the modules and the technologies used in each layer.

prediction_architecture

The end user is exposed to – and communicates directly with – the deployed web service on Google APP Engine (GAE). The deployed web application allows the user to look for and select their favourite songs and add them to their playlist, or to delete songs from their playlist. As the playlist songs are changed, an event is triggered to call the ML service and pass the song IDs that the user likes (ie, they are in his playlists). It then returns back the recommended songs. GAE interacts with the Google Cloud SQL which hosts a MySQL server to fetch the songs information for visualisation. This includes the name of the song, album, release year, artist, etc.

Visualization layer

For the web app in GAE, we have used the standard environment with Java and AngularJS. An illustration of the website is shown below.

Screenshot from 2018-03-02 14-50-43

Data layer

We use Cloud SQL with MySQL server for the data layer. The role of the DBMS is to search the songs to be selected and to keep the useful information of the songs for visualization. The Million Songs Dataset (MSD)  consists of many attributes and information about the 1 million songs in HDF5 format, but not all the information is useful, at least for visualization. The MSD contains one hdf5 file for each song. We have processed the data and converted the interesting attributes for the visualization into the relational database. The database keeps artists, albums, and songs as entities plus the song genres.

We are dealing with one million songs which can be easily queried in SQL databases with low latency. If the data was much larger, the DBMS would not be able to provide high-performance services on top of it. In thAT case, the best option would be to use BigQuery.  But for our medium to low size dataset, Cloud SQL is a better option.

The Google Cloud Storage (GCS) is a storage service which can keep any unstructured file. For the content-based recommendation, we keep a CSV file containing the features of the songs in GCS. When the GKE is deployed and it starts the recommender service, it will read this CSV file and keep the features in memory for fast computation on request. As you saw in the previous blog, the similarity-based recommender engine computes a similarity criterion between the listened songs and the other songs based on their features and will eventually return the songs it finds most similar to the ones the user has listened to, and finally returns them as the recommended songs.

ML service

This is where the two architectures diverge. The first version of the ML service – which is also used for the item-similarity recommender – encompasses two components. It has a Google Compute Engine (GCE) instance which hosts a RESTful web service using Flask. The service receives the song Ids and the number of songs to recommend as the request parameters, and then calls the Google Kubernetes Engine (GKE) to do all the computation and music recommendation. GCE then fetches the recommended songs information from the cloud SQL and returns a JSON-based response to the request, with recommended songs ids and information.

An example request to the ML service would be of the form: <ws-url>?songs=5|15|1249&k=3, which suggests the user has listened to song numbers 5, 15 and 1249, and it is asked to recommend 3 songs (the k parameter). The response is as follows:ml_response

We will fix the ML web service parameters and JSON schema. In doing this, we can substitute our ML service with a more complicated machine learning algorithm, and we require no change in other components for the whole service to work. Note that no matter what components we use for the ML service, our standardized service will work without requiring any change for the other components, as long as the ML web service inputs and outputs are standardised.

In this architecture, we use a GKE cluster that hosts TensorFlow Serving. TensorFlow Serving can be used to serve a TensorFlow model and has been designed to have an automatic update of the hosted model with zero downtime. By using the served TensorFlow model, GKE does its magic and returns the recommended songs ids.

pexels-photo-838702-e1520216571220.jpeg

Note that we are working with merely one million songs, making it is possible to keep the features of all the songs in memory. If this was not the case, it would be necessary to adopt some other schemes and components to do the computation. This is where the world of distributed processing and Hadoop might come into play (have a look at this for a good example).

So let’s summarize: the user interacts with the web project deployed on GAE to choose the songs and ask for the recommended songs. GAE communicates with the ML service, which encompasses a web service in GCE that calls the TensorFlow model to receive the recommended songs, and to fetch the attributes of the song from the Cloud SQL. A GKE cluster hosts the recommender engine. The recommended song results are received from the GKE and sent to the GCE which preps a JSON response with the recommended songs. The response is then sent to the GAE from GCE, and finally, recommendations are represented to the user.

Now that you know how the project is wired, it is worth mentioning a few details that might be challenging to someone who is implementing such an architecture. Firstly, you might be interested to know how we can send a query to TensorFlow Serving. Secondly, it might be useful to talk about how to make the GKE cluster with TensorFlow Serving. These two are investigated below before going to the next architecture.

Communicating with TensorFlow Serving

We have our RESTful web service project deployed on GCE. The web service receives as input the listened song ids and the number of songs to recommend. The service then calls TensorFlow Serving and asks for the recommended song ids, fetches the recommended songs information from the Cloud SQL and then responds to the request and passes the JSON data containing the information of the songs.

Such processes are typically known to Flask developers. The only step that might have never been encountered before is how to communicate with TensorFlow Serving through gRPC. TensorFlow Serving runs a gRPC service and therefore it is possible to communicate with it in any language in the network cluster. However, a Python client API is provided to be integrated with the gRPC API to communicate with the TensorFlow Serving. This makes the life of developers much easier. Here you go:

from grpc.beta import implementations
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2
def prep_tensorflow_serving_client(host, port=9000, model_name='recommender', signature_name='recommend_songs'):
    channel = implementations.insecure_channel(host, port)
    stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name
    return (stub, request)
def call_tensorflow_serving_recommender(stub, request, listened_songs_ids, no_recom_songs):
    request.inputs['listened_songs'].CopyFrom(tf.contrib.util.make_tensor_proto(listened_songs_ids))
    request.inputs['no_recom_songs'].CopyFrom(tf.contrib.util.make_tensor_proto(no_recom_songs))

    result = stub.Predict(request, 10.0)
    returned_res = result.outputs['recom_songs_ids']

    # converting to list
    recom_songs = returned_res.int_val
    return recom_songs

You can see that we prepare the TensorFlow Serving client by calling the prep_tensorflow_serving_client function and passing the IP of our GKE cluster. This will return stub and request variables, which are then passed to the call_tensorflow_serving_recommender recommender with the parameters of the request, namely listened_song_ids, which is a list, and no_recom_songs, which is the number of songs we’d like to receive from the TensorFlow model. It then receives the results, converts it into a list of integer values (ids of the recommended songs) and returns the list.

Having these two functions implemented, it is now possible to communicate with the TensorFlow Serving and get the recommended songs. The rest of the web service is to fetch the information from the database and create the JSON file and respond to the requests accordingly.

TensorFlow Serving and GKE

After implementing a recommender engine and saving the model in TensorFlow, the kubernetes_enginenext step is to prepare a cluster on GKE and ask the cluster to have TensorFlow Serving as a service. The process of creating a cluster with Kubernetes where TensorFlow Serving is the service for the inception model has been well documented here. The process is almost the same for any ML model. First, a Docker image with TensorFlow is prepared and then it is deployed in GKE.

We can prepare a Docker image using the Docker image file prepared in the TensorFlow Serving project. This basically prepares a container running Ubuntu and the required dependencies to run TensorFlow Serving.

You can build a local Docker image by running the commands:


#This will create a local image with the requirements for TensorFlow:
docker build --pull -t $USER/tensorflow-serving-devel -f Dockerfile.devel

#This will connect to the container (recommender_container is the name of the local container using the build image by running command):
docker run --name=recommender_container -it $USER/tensorflow-serving-devel

#installing tensorflow model server on the container
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
apt-get update
apt-get install tensorflow-model-server

Now in another shell that runs outside the container, execute the following command to copy your saved TensorFlow model into the container:


docker cp $SAVED_TF_MODEL_DIR recommender_container:/

where $SAVED_TF_MODEL_DIR is where you have saved your TensorFlow model. Now in the Docker container, run the commands below


#commit image for deployment

docker commit recommender_container $USER/recommender_container 

docker stop recommender_container 

If you like, you can test the Docker image locally to see if it works before uploading it to GKE:

docker run -it $USER/recommender_container

tensorflow_model_server --port=9000 --model_name=recommender --model_base_path=/$MODEL_DIR_NAME 

Now it’s time to deploy the Docker image in the cloud. However, just before that, make sure your cloud SDK is configured to connect to your GCP project ($PRJ_NAME):


#upload the docker image:

docker tag $USER/recommender_container gcr.io/ml-recommendation-sample/recommender

gcloud docker -- push gcr.io/ml-recommendation-sample/recommender

#this will take awhile since you are uploading your docker to the GKE

#now creating service 

kubectl create -f $YAML_FILE_ADDRESS

# Change the number of replicas accordingly:

kubectl get deployments

kubectl get services

#This will show the ip of the service. Use external IP to check if it is working 

The $YAML_FILE_ADDRESS is the address to the yaml file. For our service, this file is shown below:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: recommender-deployment-internal
  annotations:
    cloud.google.com/load-balancer-type: "internal"

spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: recommender-server-internal
    spec:
      containers:
      - name: internal-recommender-container
        image: gcr.io/ml-recommendation-sample/recommender
        command:
        - /bin/sh
        - -c
        args:
        - tensorflow_model_server
          --port=9000 --model_name=recommender --model_base_path=$YAML_FILE_ADDRESS
        ports:
        - containerPort: 9000
---
apiVersion: v1
kind: Service
metadata:
  labels:
    run: recommender-service-internal
  name: recommender-service-internal
spec:
  ports:
  - port: 9000
    targetPort: 9000
  selector:
    app: recommender-server-internal
  type: LoadBalancer

Running the commands above will deploy a computing cluster in GKE. Note that the process is very similar to that described in Google’s documentation for serving the inception model in GKE. However, we have defined the load balancer to be an internal one, since we want to use the GKE inside the Google Cluster and communicate with it through our RESTFUL web service. Our TensorFlow Serving service in the GKE will have access to the saved model, and serves the model on port 9000 with the model name ‘recommender’ and signature name ‘recommend_songs’. The protocol to communicate with TensorFlow Serving is gRPC.

MLE-based architecture

For models that will be described in later blogs, the components for the ML service will change, but the rest of the architecture will remain fixed. The big difference between the two architectures is that we make use of the Machine Learning Engine (MLE) to cloudmlserve our TensorFlow ML model. In this architecture, the TensorFlow model will be deployed on the MLE. MLE is the suggested component to consider when you have a TensorFlow machine learning model in GCP. Typically the TensorFlow models can be trained and used for prediction using the ML Engine in GCP. It is pretty easy to deploy/train/query your TensorFlow model in MLE.

 

ml_architecture

Also note that, in this version, we have used GAE as a microservice to implement the ML RESTful web service in the flexible environment, as opposed to using a GCE virtual machine for it. This is a more cost-effective option.

Another task which you might be wondering is how you can query (get predictions from) a TensorFlow model when deployed in MLE. This is what you find next.

Communicating with Google ML Engine

After introducing your model to MLE, a RESTFUL web service is automatically provided to ease communication. Apart from calling the web service directly, you can use the APIs provided in different languages to communicate with it. Here, we show how you can query your recommender model.

If your model is called in the GCP, the caller should have been authorized to call MLE. Also, it is possible to call your model from your local computer. For this, you need credentials to be able to call a model in MLE.

</pre>
<pre>import os
import googleapiclient.discovery
credential_address = os.environ['CREDENTIALS_FILE']
project = os.environ['GCP_PRJ_NAME']
model = os.environ['ML_MODEL_NAME']
version = os.environ['ML_VERSION']#can be None
user_input = os.environ['user_query'] #assuming a json string with keys being the name of tensors the model expects as input, and the values are the datatypes convertible to tensors or list of datatypes (can be nested) convertible to tensors

#introducing the credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credential_address
# in order to authenticate we set the environmentGOOGLE_APPLICATION_CREDENTIALS, but if it is an app engine or compute engine project, we can call:
# credentials = Google Credentials.get_aplication_default()

service = googleapiclient.discovery.build('ml','v1')
name = 'projects/{}/models/{}'.format(project,model)

if version is not None:
    name += '/versions/{}'.format(version)

response = service.projects().predict(name=name,body={'instances: user_input}).execute()

if 'error' in response:
    raise RuntimeError(response['error'])

print (response['predictions'])

Summary

In this blog, we talked about the two architectures we use throughout the journey of music recommendation. Both host the prediction service components on the cloud, with the main divergence being how the ML service itself is hosted.

We saw that perhaps the most flexible configuration for an ML service is to use a GKE for TensorFlow serving. This allows manual cluster and hardware configuration, allows limitless TensorFlow model sizes, and results in zero serving downtime, even when updating the model to a higher version. However, compared to the MLE option, the preparation of the cluster is more engaging, and leads to a more expensive service, especially if GPUs are requested in the cluster.

In contrast, the MLE is the standard (and suggested) machine learning component in GCP. It automatically assigns a GPU computing cluster to the model, auto-scales the ML service if required, facilitates training and querying an ML TensorFlow model, and is finally cost-effective. Using this solution, an ML specialist can basically concentrate on the model, rather than having to consider other aspects like hardware configuration and availability.

In the next blog, we deep dive into the collaborative filtering recommender model. Our choice of collaborative filtering is the Matrix Factorisation. We’ll roll up our sleeves and use TensorFlow Transform and Google DataFlow to preprocess our training and validation dataset, implement a machine learning model by extending the estimators in TensorFlow, and finally train our ML model.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s