sahsAGU March 5, 2019 at 10:00

Training TensorFlow models with Azure Machine Learning Service

Tutorial

For deep neural network (DNN) training with TensorFlow, Azure Machine Learning provides a custom class of TensorFlow assessment tool Estimator. The evaluation tool TensorFlow in the Azure SDK (not to be confused with the class tf.estimator.Estimator) makes it easy to submit TensorFlow training tasks for single-node and distributed runs in Azure compute resources.

Single-site training

Learning with the assessment tool is TensorFlow similar to using it базового средства оценкиEstimator, so first read the how-to article and study the concepts.

To complete the TensorFlow task, you must create an object TensorFlow. You should already have created the compute_target target computing resource object .

from azureml.train.dnn import TensorFlow
script_params = {
    '--batch-size': 50,
    '--learning-rate': 0.01,
}
tf_est = TensorFlow(source_directory='./my-tf-proj',
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script='train.py',
                    conda_packages=['scikit-learn'],
                    use_gpu=True)

Specify the following parameters in the TensorFlow constructor.

Parameter	DESCRIPTION
`source_directory`	A local directory that contains all the code needed to complete the training. This folder is copied from the local computer to the remote computing resource.
`script_params`	A dictionary specifying command line arguments for the training script `entry_script` as pairs <command line argument, value>.
`compute_target`	The remote computation target on which the training script will run. In our case, this is a cluster of the Azure Machine Learning Computing Environment ( AmlCompute ).
`entry_script`	The path to the file (relatively `source_directory`) of the training script that will be executed on the remote computing resource. This file and additional files on which it depends should be located in this folder.
`conda_packages`	A list of Python packages required for the training script to be installed using conda. In this case, the training script uses `sklearn`to download data, so you must specify this package for installation. The `pip_packages`constructor parameter can be used for all necessary pip packages.
`use_gpu`	Set this flag `True`to use the GPU for training. The default is `False`.

Since you are working with the TensorFlow evaluation tool, the container used for training will by default contain the TensorFlow package and related dependencies required for training in the CPU and GPU.

Then submit the TensorFlow job:

run = exp.submit(tf_est)

Distributed training

The TensorFlow Evaluation Tool also lets you train models in the CPU and GPU clusters of Azure virtual machines. TensorFlow distributed training is delivered through several API calls, with the Azure Machine Learning service in the background managing the infrastructure and orchestration features needed to complete these workloads.

Azure Machine Learning Services supports two distributed learning methods in TensorFlow.

MPI-based distributed learning using the Horovod platform .
TensorFlow native distributed training using the parameter server method.

Horovod

Horovod is an open source distributed learning-based ring-allreduce algorithm developed by Uber.

To run TensorFlow distributed training using the Horovod platform, create a TensorFlow object as follows:

from azureml.train.dnn import TensorFlow
tf_est = TensorFlow(source_directory='./my-tf-proj',
                    script_params={},
                    compute_target=compute_target,
                    entry_script='train.py',
                    node_count=2,
                    process_count_per_node=1,
                    distributed_backend='mpi',
                    use_gpu=True)

The above code shows the following new options in the TensorFlow constructor.

Parameter	DESCRIPTION	default value
`node_count`	The number of nodes to be used for the training task.	`1`
`process_count_per_node`	The number of processes (or work roles) that are running on each node.	`1`
`distributed_backend`	The server side for running distributed learning, offered by the MPI assessment tool. To perform parallel or distributed training (for example, `node_count`> 1 or `process_count_per_node`> 1, or both) using MPI (and Horovod), specify. `distributed_backend='mpi'`Azure Machine Learning uses the MPI Open MPI implementation .	`None`

In the above example, distributed training will be performed with two work roles - one work role for each node.

Horovod and its dependencies will be installed automatically, so you can simply import them into the training script train.py as follows:

import tensorflow as tf
import horovod

Finally, submit your TensorFlow job:

run = exp.submit(tf_est)

Parameter Server

You can also run your own TensorFlow distributed training that uses the parameter server model. In this method, training is conducted in a cluster of parameter servers and work roles. During training, worker roles calculate gradients, and parameter servers perform statistical processing of gradients.

Create a TensorFlow object:

from azureml.train.dnn import TensorFlow
tf_est = TensorFlow(source_directory='./my-tf-proj',
                    script_params={},
                    compute_target=compute_target,
                    entry_script='train.py',
                    node_count=2,
                    worker_count=2,
                    parameter_server_count=1,
                    distributed_backend='ps',
                    use_gpu=True)

Note the following parameters in the TensorFlow constructor in the code above.

Parameter	DESCRIPTION	default value
`worker_count`	The number of work roles.	`1`
`parameter_server_count`	Number of parameter servers.	`1`
`distributed_backend`	The server part that will be used for distributed training. To conduct distributed training using the parameter server, set the value `distributed_backend='ps'`.	`None`

Notes on `TF_CONFIG`

You will also need cluster network addresses and ports tf.train.ClusterSpec, so the Azure Machine Learning service automatically sets the environment variable TF_CONFIG.

The environment variable TF_CONFIG is a JSON string. The following is an example variable for the parameter server.

TF_CONFIG='{
    "cluster": {
        "ps": ["host0:2222", "host1:2222"],
        "worker": ["host2:2222", "host3:2222", "host4:2222"],
    },
    "task": {"type": "ps", "index": 0},
    "environment": "cloud"
}'

If you use the high-level tf.estimator TensorFlow API , TensorFlow will analyze this variable TF_CONFIG and form a cluster specification.

If you use a lower-level API for training, you need to analyze the variable yourself TF_CONFIG and create it tf.train.ClusterSpec in the training code. In this example, these actions are performed in a training script as follows:

import os, json
import tensorflow as tf
tf_config = os.environ.get('TF_CONFIG')
if not tf_config or tf_config == "":
    raise ValueError("TF_CONFIG not found.")
tf_config_json = json.loads(tf_config)
cluster_spec = tf.train.ClusterSpec(cluster)

After completing the writing of the training script and creating the TensorFlow object, submit the training task:

run = exp.submit(tf_est)

Examples

For distributed deep learning notebooks, see the GitHub repository, section

how-to-use-azureml / training-with-deep-learning

Learn how to run notebooks by following the directions in an article on how to learn this service with Jupyter notebooks .

Additional Information

Tags: