Training TensorFlow models with Azure Machine Learning Service

  • Tutorial

For deep neural network (DNN) training with TensorFlow, Azure Machine Learning provides a custom class of  TensorFlow assessment tool  Estimator. The evaluation tool  TensorFlow in the Azure SDK (not to be confused with the class  tf.estimator.Estimator) makes it easy to submit TensorFlow training tasks for single-node and distributed runs in Azure compute resources.




Single-site training


Learning with the assessment tool is  TensorFlow similar to using it  базового средства оценкиEstimator, so first read the how-to article and study the concepts.


To complete the TensorFlow task, you must create an object  TensorFlow. You should already have created the compute_target target computing resource object  .


from azureml.train.dnn import TensorFlow
script_params = {
    '--batch-size': 50,
    '--learning-rate': 0.01,
}
tf_est = TensorFlow(source_directory='./my-tf-proj',
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script='train.py',
                    conda_packages=['scikit-learn'],
                    use_gpu=True)

Specify the following parameters in the TensorFlow constructor.

ParameterDESCRIPTION
source_directoryA local directory that contains all the code needed to complete the training. This folder is copied from the local computer to the remote computing resource.
script_paramsA dictionary specifying command line arguments for the training script  entry_script as pairs <command line argument, value>.
compute_targetThe remote computation target on which the training script will run. In our case, this is a cluster of the Azure Machine Learning Computing Environment ( AmlCompute ).
entry_scriptThe path to the file (relatively  source_directory) of the training script that will be executed on the remote computing resource. This file and additional files on which it depends should be located in this folder.
conda_packagesA list of Python packages required for the training script to be installed using conda. In this case, the training script uses  sklearnto download data, so you must specify this package for installation. The pip_packagesconstructor parameter  can be used for all necessary pip packages.
use_gpuSet this flag  Trueto use the GPU for training. The default is  False.

Since you are working with the TensorFlow evaluation tool, the container used for training will by default contain the TensorFlow package and related dependencies required for training in the CPU and GPU.


Then submit the TensorFlow job:


run = exp.submit(tf_est)

Distributed training


The TensorFlow Evaluation Tool also lets you train models in the CPU and GPU clusters of Azure virtual machines. TensorFlow distributed training is delivered through several API calls, with the Azure Machine Learning service in the background managing the infrastructure and orchestration features needed to complete these workloads.


Azure Machine Learning Services supports two distributed learning methods in TensorFlow.



Horovod


Horovod  is an open source distributed learning-based ring-allreduce algorithm developed by Uber.


To run TensorFlow distributed training using the Horovod platform, create a TensorFlow object as follows:


from azureml.train.dnn import TensorFlow
tf_est = TensorFlow(source_directory='./my-tf-proj',
                    script_params={},
                    compute_target=compute_target,
                    entry_script='train.py',
                    node_count=2,
                    process_count_per_node=1,
                    distributed_backend='mpi',
                    use_gpu=True)

The above code shows the following new options in the TensorFlow constructor.

ParameterDESCRIPTIONdefault value
node_countThe number of nodes to be used for the training task.1
process_count_per_nodeThe number of processes (or work roles) that are running on each node.1
distributed_backendThe server side for running distributed learning, offered by the MPI assessment tool. To perform parallel or distributed training (for example,  node_count> 1 or  process_count_per_node> 1, or both) using MPI (and Horovod), specify.  distributed_backend='mpi'Azure Machine Learning uses the MPI Open MPI implementation  .None

In the above example, distributed training will be performed with two work roles - one work role for each node.


Horovod and its dependencies will be installed automatically, so you can simply import them into the training script  train.py as follows:


import tensorflow as tf
import horovod

Finally, submit your TensorFlow job:


run = exp.submit(tf_est)

Parameter Server


You can also run  your own TensorFlow distributed training that uses the parameter server model. In this method, training is conducted in a cluster of parameter servers and work roles. During training, worker roles calculate gradients, and parameter servers perform statistical processing of gradients.


Create a TensorFlow object:


from azureml.train.dnn import TensorFlow
tf_est = TensorFlow(source_directory='./my-tf-proj',
                    script_params={},
                    compute_target=compute_target,
                    entry_script='train.py',
                    node_count=2,
                    worker_count=2,
                    parameter_server_count=1,
                    distributed_backend='ps',
                    use_gpu=True)

Note the following parameters in the TensorFlow constructor in the code above.

ParameterDESCRIPTIONdefault value
worker_countThe number of work roles.1
parameter_server_countNumber of parameter servers.1
distributed_backendThe server part that will be used for distributed training. To conduct distributed training using the parameter server, set the value  distributed_backend='ps'.None

Notes on TF_CONFIG


You will also need cluster network addresses and ports  tf.train.ClusterSpec, so the Azure Machine Learning service automatically sets the environment variable  TF_CONFIG.


The environment variable  TF_CONFIG is a JSON string. The following is an example variable for the parameter server.


TF_CONFIG='{
    "cluster": {
        "ps": ["host0:2222", "host1:2222"],
        "worker": ["host2:2222", "host3:2222", "host4:2222"],
    },
    "task": {"type": "ps", "index": 0},
    "environment": "cloud"
}'

If you use the high-level tf.estimator TensorFlow API  , TensorFlow will analyze this variable  TF_CONFIG and form a cluster specification.


If you use a lower-level API for training, you need to analyze the variable yourself TF_CONFIG and create it  tf.train.ClusterSpec in the training code. In  this example,  these actions are performed in  a training script  as follows:


import os, json
import tensorflow as tf
tf_config = os.environ.get('TF_CONFIG')
if not tf_config or tf_config == "":
    raise ValueError("TF_CONFIG not found.")
tf_config_json = json.loads(tf_config)
cluster_spec = tf.train.ClusterSpec(cluster)

After completing the writing of the training script and creating the TensorFlow object, submit the training task:


run = exp.submit(tf_est)

Examples


For distributed deep learning notebooks, see the GitHub repository, section



Learn how to run notebooks by following the directions in an article on how to learn this service with Jupyter notebooks .


Additional Information



Also popular now: