How We Increased Tensorflow Serving Productivity by 70%

Original author: Masroor Hasan
  • Transfer
  • Tutorial
Tensorflow has become the standard platform for machine learning (ML), popular both in the industry and in research. Many free libraries, tools, and frameworks have been created for training and maintaining ML models. The Tensorflow Serving project helps service ML models in a distributed production environment.

Our Mux service uses Tensorflow Serving in several parts of the infrastructure, we have already discussed the use of Tensorflow Serving in encoding video titles. Today we will focus on methods that improve latency by optimizing both the forecast server and the client. Model forecasts are usually “online” operations (on the critical path of requesting an application), therefore, the main goals of optimization are to process large volumes of requests with the lowest possible delay.

What is Tensorflow Serving?

Tensorflow Serving provides a flexible server architecture for deploying and maintaining ML models. Once the model is trained and ready to be used for forecasting, Tensorflow Serving requires exporting it to a compatible (servable) format.

Servable  is a central abstraction that wraps Tensorflow objects. For example, a model can be represented as one or more Servable objects. Thus, Servable are the basic objects that the client uses to perform calculations. Servable size matters: smaller models take up less space, use less memory and load faster. To download and maintain using the Predict API, models must be in SavedModel format.

Tensorflow Serving combines the basic components to create a gRPC / HTTP server that serves several ML models (or several versions), provides monitoring components and a custom architecture.

Tensorflow Serving with Docker

Let's look at the basic delay metric in forecasting performance with the standard Tensorflow Serving settings (without CPU optimization).

First, download the latest image from the TensorFlow Docker hub:

docker pull tensorflow/serving:latest  

In this article, all containers run on a host with four cores, 15 GB, Ubuntu 16.04.

Export Tensorflow Model to SavedModel Format

When a model is trained using Tensorflow, the output can be saved as variable control points (files on disk). The output is performed directly by restoring control points of the model or in a frozen frozen graph format (binary file).

For Tensorflow Serving, this frozen graph needs to be exported to SavedModel format. The Tensorflow documentation contains examples of exporting trained models to the SavedModel format.

Tensorflow also provides many official and research models as a starting point for experimentation, research, or production.

As an example, we will use the deep residual neural network model (ResNet)to classify an ImageNet dataset from 1000 classes. Download the pre - trained model ResNet-50 v2, specifically the Channels_last (NHWC) option in SavedModel : as a rule, it works better on the CPU.

Copy the RestNet model directory into the following structure:


Tensorflow Serving expects a numerically ordered directory structure for versioning. In our case, the catalog 1/corresponds to the version 1 model, which contains the model architecture saved_model.pbwith a snapshot of the model weights (variables).

Loading and processing SavedModel

The following command starts the Tensorflow Serving model server in a Docker container. To load SavedModel, you must mount the model directory in the expected container directory.

docker run -d -p 9000:8500 \  
  -v $(pwd)/models:/models/resnet -e MODEL_NAME=resnet \
  -t tensorflow/serving:latest

Checking the container logs shows that ModelServer is running and ready to serve output requests for the model resnetat the gRPC and HTTP endpoints:

I tensorflow_serving/core/] Successfully loaded servable version {name: resnet version: 1}  
I tensorflow_serving/model_servers/] Running gRPC ModelServer at ...  
I tensorflow_serving/model_servers/] Exporting HTTP/REST API at:localhost:8501 ...  

Forecasting Client

Tensorflow Serving defines an API schema in protocol buffers (protobufs) format . GRPC client implementations for the forecasting API are packaged as a Python package tensorflow_serving.apis. We will need another Python package tensorflowfor utility functions.

Install the dependencies to create a simple client:

virtualenv .env && source .env/bin/activate && \  
  pip install numpy grpcio opencv-python tensorflow tensorflow-serving-api

The model ResNet-50 v2expects input of floating point tensors in a formatted data structure channels_last (NHWC). Therefore, the input image is read using opencv-python and loaded into the numpy array (height × width × channels) as a float32 data type. The script below creates a prediction client stub and loads the JPEG data into a numpy array, converts it to tensor_proto to make a forecast request for gRPC:

#!/usr/bin/env pythonfrom __future__ import print_function
import argparse
import numpy as np
import time
tt = time.time()
import cv2
import tensorflow as tf
from grpc.beta import implementations
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2
parser = argparse.ArgumentParser(description='incetion grpc client flags.')
parser.add_argument('--host', default='', help='inception serving host')
parser.add_argument('--port', default='9000', help='inception serving port')
parser.add_argument('--image', default='', help='path to JPEG image file')
FLAGS = parser.parse_args()
defmain():# create prediction service client stub
  channel = implementations.insecure_channel(, int(FLAGS.port))
  stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
  # create request
  request = predict_pb2.PredictRequest() = 'resnet'
  request.model_spec.signature_name = 'serving_default'# read image into numpy array
  img = cv2.imread(FLAGS.image).astype(np.float32)
  # convert to tensor proto and make request# shape is in NHWC (num_samples x height x width x channels) format
  tensor = tf.contrib.util.make_tensor_proto(img, shape=[1]+list(img.shape))
  resp = stub.Predict(request, 30.0)
  print('total time: {}s'.format(time.time() - tt))
if __name__ == '__main__':

Having received a JPEG input, a working client will produce the following result:

python --image=images/pupper.jpg  
total time: 2.56152906418s  

The resulting tensor contains a forecast in the form of an integer value and probability of signs.

outputs {  
  key: "classes"
  value {
    dtype: DT_INT64
    tensor_shape {
      dim {
        size: 1
    int64_val: 238
outputs {  
  key: "probabilities"

For a single request, such a delay is not acceptable. But nothing surprising: the Tensorflow Serving binary is by default designed for the widest range of equipment for most use cases. You probably noticed the following lines in the logs of the standard Tensorflow Serving container:

I external/org_tensorflow/tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA  

This indicates a TensorFlow Serving binary running on a CPU platform for which it has not been optimized.

Build an optimized binary

According to the Tensorflow documentation , it is recommended to compile Tensorflow from source with all optimizations available for the CPU on the host where the binary will work. When assembling, special flags enable activation of CPU instruction sets for a specific platform:

Instruction setFlags
AVX--copt = -mavx
AVX2--copt = -mavx2
Fma--copt = -mfma
SSE 4.1--copt = -msse4.1
SSE 4.2--copt = -msse4.2
All supported by processor--copt = -march = native

Clone a Tensorflow Serving of a specific version. In our case, this is 1.13 (the last at the time of publication of this article):

git clone --branch="$TF_SERVING_VERSION_GIT_BRANCH" 

The Tensorflow Serving dev image uses the Basel tool to build. We configure it for specific sets of CPU instructions:

TF_SERVING_BUILD_OPTIONS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.1 --copt=-msse4.2"

If memory is low, limit the memory consumption during the build process with the flag --local_resources=2048,.5,1.0. For information on flags, see the Tensorflow Serving and Docker help , as well as the Bazel documentation .

Create a working image based on the existing one:


git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}"
TF_SERVING_BUILD_OPTIONS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.1 --copt=-msse4.2"cd serving && \
  docker build --pull -t $USER/tensorflow-serving-devel:$TAG \
  -f tensorflow_serving/tools/docker/Dockerfile.devel .
cd serving && \
  docker build -t $USER/tensorflow-serving:$TAG \
  --build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel:$TAG \
  -f tensorflow_serving/tools/docker/Dockerfile .

ModelServer is configured using TensorFlow flags to support concurrency. The following options configure two thread pools for parallel operation:


  • controls the maximum number of threads for parallel execution of one operation;
  • used to parallelize operations that have sub-operations that are independent in nature.


  • controls the maximum number of threads for parallel execution of independent operations;
  • Tensorflow Graph operations, which are independent of each other and, therefore, can be performed in different threads.

By default, both options are set 0. This means that the system itself selects the appropriate number, which most often means one thread per core. However, the parameter can be manually changed for multi-core concurrency.

Then start the Serving container in the same way as the previous one, this time with a Docker image compiled from the sources and with Tensorflow optimization flags for a specific processor:

docker run -d -p 9000:8500 \  
  -v $(pwd)/models:/models/resnet -e MODEL_NAME=resnet \
  -t $USER/tensorflow-serving:$TAG \
    --tensorflow_intra_op_parallelism=4 \

Container logs should no longer show warnings about an undefined CPU. Without changing the code on the same forecast request, the delay is reduced by about 35.8%:

python --image=images/pupper.jpg  
total time: 1.64234706879s  

Speed ​​increase in client forecasting

Is it still possible to accelerate? We have optimized the server side for our CPU, but a delay of more than 1 second still seems too large.

It so happened that loading libraries tensorflow_servingand makes a significant contribution to the delay tensorflow. Each unnecessary call tf.contrib.util.make_tensor_protoalso adds a split second.

You may ask: "Do we not need TensorFlow Python packages to actually do predict Tensorflow requests to the server," In fact, a real need in the packets tensorflow_servingand tensorflowno.

As noted earlier, the Tensorflow prediction APIs are defined as proto-buffers. Therefore, two external dependencies can be replaced with the corresponding stubs tensorflowandtensorflow_serving - and then you do not need to pull the entire (heavy) Tensorflow library on the client.

For a start, get rid of dependency tensorflowand tensorflow_servingand add the package grpcio-tools.

pip uninstall tensorflow tensorflow-serving-api && \  
  pip install grpcio-tools==1.0.0

Clone the repositories tensorflow/tensorflowand tensorflow/servingcopy the following protobuf files to the client project:


Copy these protobuf files to a directory protos/with the original paths preserved:


For simplicity, prediction_service.proto can be simplified to implement only Predict RPC so as not to download the nested dependencies of other RPCs specified in the service. Here is an example of a simplified one prediction_service.прото.

Create Python gRPC implementations with

PROTOS=$(find . | grep "\.proto$")  
for p in $PROTOS; do  
  python -m -I . --python_out=$PROTOC_OUT --grpc_python_out=$PROTOC_OUT $p

Now the whole module tensorflow_servingcan be removed:

from tensorflow_serving.apis import predict_pb2  
from tensorflow_serving.apis import prediction_service_pb2  

... and replace with the generated protobuffers from protos/tensorflow_serving/apis:

from protos.tensorflow_serving.apis import predict_pb2  
from protos.tensorflow_serving.apis import prediction_service_pb2

Tensorflow library is imported to use an auxiliary function make_tensor_proto, which is needed for wrapping python / numpy TensorProto object as the object.

Thus, we can replace the following dependency and code fragment:

import tensorflow as tf  
tensor = tf.contrib.util.make_tensor_proto(features)  

import protobuffers and building a TensorProto object:

from protos.tensorflow.core.framework import tensor_pb2  
from protos.tensorflow.core.framework import tensor_shape_pb2  
from protos.tensorflow.core.framework import types_pb2  
# ensure NHWC shape and build tensor proto
tensor_shape = [1]+list(img.shape)  
dims = [tensor_shape_pb2.TensorShapeProto.Dim(size=dim) for dim in tensor_shape]  
tensor_shape = tensor_shape_pb2.TensorShapeProto(dim=dims)  
tensor = tensor_pb2.TensorProto(  

The full Python script is here . Run an updated starter client that makes a prediction request for optimized Tensorflow Serving:

python --image=images/pupper.jpg  
total time: 0.58314920859s  

The following diagram shows the forecast execution time in the optimized version of Tensorflow Serving compared to the standard, over 10 runs:

The average delay decreased by about 3.38 times.

Bandwidth Optimization

Tensorflow Serving can be configured to handle large amounts of data. Bandwidth optimization is usually performed for “stand-alone” batch processing, where tight delay boundaries are not a strict requirement.

Server Side Batch Processing

As stated in the documentation , server-side batch processing is natively supported in Tensorflow Serving.

The trade-offs between latency and throughput are determined by batch processing parameters. They allow you to achieve the maximum bandwidth that hardware accelerators are capable of.

To enable packaging, set the --enable_batchingand flags --batching_parameters_file. Parameters are set according to SessionBundleConfig . For systems on the CPU, set num_batch_threadsto the number of available cores. For the GPU, see the appropriate parameters  here .

After filling out the whole package on the server side, the issuance requests are combined into one large request (tensor), and sent to the Tensorflow session with a combined request. In this situation, CPU / GPU parallelism is really involved.

Some common uses for Tensorflow batch processing:

  • Using asynchronous client requests to populate server-side packets
  • Speeding up batch processing by transferring the components of the model graph to the CPU / GPU
  • Serving requests from multiple models from a single server
  • Batch processing is highly recommended for "offline" processing of a large number of requests

Client Side Batch Processing

Client-side batch processing groups several incoming requests into one.

Since the ResNet model is awaiting input in NHWC format (the first dimension is the number of inputs), we can combine several input images into one RPC request:

batch = []  
for jpeg in os.listdir(FLAGS.images_path):  
  path = os.path.join(FLAGS.images_path, jpeg)
  img = cv2.imread(path).astype(np.float32)
batch_np = np.array(batch).astype(np.float32)  
dims = [tensor_shape_pb2.TensorShapeProto.Dim(size=dim) for dim in batch_np.shape]  
t_shape = tensor_shape_pb2.TensorShapeProto(dim=dims)  
tensor = tensor_pb2.TensorProto(  

For a packet of N images, the output tensor in the response will contain the prediction results for the same number of inputs. In our case, N = 2:

outputs {  
  key: "classes"
  value {
    dtype: DT_INT64
    tensor_shape {
      dim {
        size: 2
    int64_val: 238
    int64_val: 121

Hardware acceleration

A few words about GPUs.

The learning process naturally uses parallelization on the GPU, since the construction of deep neural networks requires massive computations to achieve the optimal solution.

But for outputting results, parallelization is not so obvious. Often you can speed up the output of a neural network to a GPU, but you need to carefully select and test the equipment, and conduct in-depth technical and economic analysis. Hardware parallelization is more valuable for batch processing of "autonomous" conclusions (massive volumes).

Before moving to a GPU, consider business requirements with a careful analysis of the costs (monetary, operational, technical) for the greatest benefit (reduced latency, high throughput).

Also popular now: