
Training TensorFlow models with Azure Machine Learning Service
- Tutorial
For deep neural network (DNN) training with TensorFlow, Azure Machine Learning provides a custom class of TensorFlow
assessment tool Estimator
. The evaluation tool TensorFlow
in the Azure SDK (not to be confused with the class tf.estimator.Estimator
) makes it easy to submit TensorFlow training tasks for single-node and distributed runs in Azure compute resources.

Single-site training
Learning with the assessment tool is TensorFlow
similar to using it базового средства оценкиEstimator
, so first read the how-to article and study the concepts.
To complete the TensorFlow task, you must create an object TensorFlow
. You should already have created the compute_target
target computing resource object .
from azureml.train.dnn import TensorFlow
script_params = {
'--batch-size': 50,
'--learning-rate': 0.01,
}
tf_est = TensorFlow(source_directory='./my-tf-proj',
script_params=script_params,
compute_target=compute_target,
entry_script='train.py',
conda_packages=['scikit-learn'],
use_gpu=True)
Specify the following parameters in the TensorFlow constructor.
Parameter | DESCRIPTION |
---|---|
source_directory | A local directory that contains all the code needed to complete the training. This folder is copied from the local computer to the remote computing resource. |
script_params | A dictionary specifying command line arguments for the training script entry_script as pairs <command line argument, value>. |
compute_target | The remote computation target on which the training script will run. In our case, this is a cluster of the Azure Machine Learning Computing Environment ( AmlCompute ). |
entry_script | The path to the file (relatively source_directory ) of the training script that will be executed on the remote computing resource. This file and additional files on which it depends should be located in this folder. |
conda_packages | A list of Python packages required for the training script to be installed using conda. In this case, the training script uses sklearn to download data, so you must specify this package for installation. The pip_packages constructor parameter can be used for all necessary pip packages. |
use_gpu | Set this flag True to use the GPU for training. The default is False . |
Since you are working with the TensorFlow evaluation tool, the container used for training will by default contain the TensorFlow package and related dependencies required for training in the CPU and GPU.
Then submit the TensorFlow job:
run = exp.submit(tf_est)
Distributed training
The TensorFlow Evaluation Tool also lets you train models in the CPU and GPU clusters of Azure virtual machines. TensorFlow distributed training is delivered through several API calls, with the Azure Machine Learning service in the background managing the infrastructure and orchestration features needed to complete these workloads.
Azure Machine Learning Services supports two distributed learning methods in TensorFlow.
- MPI-based distributed learning using the Horovod platform .
- TensorFlow native distributed training using the parameter server method.
Horovod
Horovod is an open source distributed learning-based ring-allreduce algorithm developed by Uber.
To run TensorFlow distributed training using the Horovod platform, create a TensorFlow object as follows:
from azureml.train.dnn import TensorFlow
tf_est = TensorFlow(source_directory='./my-tf-proj',
script_params={},
compute_target=compute_target,
entry_script='train.py',
node_count=2,
process_count_per_node=1,
distributed_backend='mpi',
use_gpu=True)
The above code shows the following new options in the TensorFlow constructor.
Parameter | DESCRIPTION | default value |
---|---|---|
node_count | The number of nodes to be used for the training task. | 1 |
process_count_per_node | The number of processes (or work roles) that are running on each node. | 1 |
distributed_backend | The server side for running distributed learning, offered by the MPI assessment tool. To perform parallel or distributed training (for example, node_count > 1 or process_count_per_node > 1, or both) using MPI (and Horovod), specify. distributed_backend='mpi' Azure Machine Learning uses the MPI Open MPI implementation . | None |
In the above example, distributed training will be performed with two work roles - one work role for each node.
Horovod and its dependencies will be installed automatically, so you can simply import them into the training script train.py
as follows:
import tensorflow as tf
import horovod
Finally, submit your TensorFlow job:
run = exp.submit(tf_est)
Parameter Server
You can also run your own TensorFlow distributed training that uses the parameter server model. In this method, training is conducted in a cluster of parameter servers and work roles. During training, worker roles calculate gradients, and parameter servers perform statistical processing of gradients.
Create a TensorFlow object:
from azureml.train.dnn import TensorFlow
tf_est = TensorFlow(source_directory='./my-tf-proj',
script_params={},
compute_target=compute_target,
entry_script='train.py',
node_count=2,
worker_count=2,
parameter_server_count=1,
distributed_backend='ps',
use_gpu=True)
Note the following parameters in the TensorFlow constructor in the code above.
Parameter | DESCRIPTION | default value |
---|---|---|
worker_count | The number of work roles. | 1 |
parameter_server_count | Number of parameter servers. | 1 |
distributed_backend | The server part that will be used for distributed training. To conduct distributed training using the parameter server, set the value distributed_backend='ps' . | None |
Notes on TF_CONFIG
You will also need cluster network addresses and ports tf.train.ClusterSpec
, so the Azure Machine Learning service automatically sets the environment variable TF_CONFIG
.
The environment variable TF_CONFIG
is a JSON string. The following is an example variable for the parameter server.
TF_CONFIG='{
"cluster": {
"ps": ["host0:2222", "host1:2222"],
"worker": ["host2:2222", "host3:2222", "host4:2222"],
},
"task": {"type": "ps", "index": 0},
"environment": "cloud"
}'
If you use the high-level tf.estimator
TensorFlow API , TensorFlow will analyze this variable TF_CONFIG
and form a cluster specification.
If you use a lower-level API for training, you need to analyze the variable yourself TF_CONFIG
and create it tf.train.ClusterSpec
in the training code. In this example, these actions are performed in a training script as follows:
import os, json
import tensorflow as tf
tf_config = os.environ.get('TF_CONFIG')
if not tf_config or tf_config == "":
raise ValueError("TF_CONFIG not found.")
tf_config_json = json.loads(tf_config)
cluster_spec = tf.train.ClusterSpec(cluster)
After completing the writing of the training script and creating the TensorFlow object, submit the training task:
run = exp.submit(tf_est)
Examples
For distributed deep learning notebooks, see the GitHub repository, section
Learn how to run notebooks by following the directions in an article on how to learn this service with Jupyter notebooks .