b0noII January 3, 2017 at 07:58

Train a neural network written in TensorFlow in the cloud using Google Cloud ML and Cloud Shell

Transfer
Tutorial

In a previous article, we discussed how to train a chat bot based on a recurrent neural network on an AWS GPU instance . Today we will see how easy it is to train the same network using Google Cloud ML and Google Cloud Shell . Thanks to Google Cloud Shell, you will not need to do almost anything on the local computer! By the way, we took the network from the last article as an example, you can safely take any other network that uses TensorFlow.

Instead of the foreword

Special thanks to my patrons who made this article possible: Aleksandr Shepeliev, Sergei Ten, Alexey Polietaiev, Nikita Penzin, Andrey Karnaukhov, Matveev Evgeny, Anton Potemkin.

I tried to make the article a self-sufficient guide, but I strongly advise you to look at each link to understand what is happening under the hood, and not just copy-paste the commands step by step.

Prerequisites

There is only one requirement that the reader must satisfy in order to go through all the steps described in the article: to have a Google Cloud account with billing enabled, since we will use paid functionality.

Let's start our journey with answers to two main questions:

What is Google Cloud ML?
What is Google Cloud Shell?

What is Google Cloud ML?

The official definition says the following:

Google Cloud Machine Learning brings the power and flexibility of TensorFlow to the cloud. You can use its components to select and extract features from your data, train your machine learning models, and get predictions using the managed resources of Google Cloud Platform.

I don’t know about you, but this definition says little to me. I’ll try to explain what Google Cloud ML can give you:

Deploy your code in the cloud on a machine that has everything you need to train the TensorFlow model;
provide access to Google Cloud Storage bucket for your code on a machine in the cloud;
run your code, which is responsible for training, to execute;
place the model in the cloud for storage;
Use a trained model to predict future data.

The focus of this article will be on the first 3 points. Later, in future articles, we will look at how to deploy a trained model in Google Cloud ML and how to predict data using a cloud model.

What is Google Cloud Shell?

And again, the official definition :

Google Cloud Shell is a shell environment for managing resources hosted on Google Cloud Platform.

And again, I will add a few details, Google Cloud Shell is:

the cloud instance provided to you (type?),
with Debian OS on board,
which Shell you can access through the Web,
where you have everything you need to work with Google Cloud.

Yes, you understood correctly, you have a completely free instance with access to Shell, which you can access from your Web console.

But nothing comes free, in the case of Cloud Shell there are some offensive restrictions - you can access it only through the Web console, and not through ssh (personally, I don’t like to use any other terminals except iTerm). I asked a question on StackOverflow whether it is possible to use Cloud Shell through ssh and, it’s impossible, to go. But at least there is a way to make your life easier by installing a special Chrome plug-in, which, at a minimum, allows you to use normal key bindings so that the terminal works like a terminal and not like a browser window (of which this stray is =)).

More information on Cloud Shell features can be found here..

The steps that we have to go through:

Preparing Cloud Shell for Learning
Cloud Storage Preparation
Training data preparation
Training script preparation
Testing the learning process locally
Training
Conversation with a bot

Preparing Cloud Shell Environment for Learning

It's time to open Cloud Shell. If you haven’t done this before, it’s very simple, you need to open the console.cloud.google.com console and click on the Shell icon in the upper right corner:

In case of any problems, here is a short instruction that describes how to start the console in details .

All subsequent examples will be executed in Cloud Shell.

In addition, if this is your first time going to use Cloud ML with Cloud Shell, you need to prepare all the necessary dependencies. To do this, run just one line of code directly in Shell:

curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/tools/setup_cloud_shell.sh | bash

He will install all the necessary packages.

If at this stage everything stumbled on the pillow installation:

Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-urImDr/olefile/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.r
ead().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" build_ext --disable-jpeg install --record /tmp/pip-GHGxvS-record/install-record.txt -
-single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-build-urImDr/olefile/

Then we will help manual installation:

pip install --user --upgrade pillow

Thanks for the tip @ Sp0tted_0wl. Next, you will have to update the PATH variable:

export PATH=${HOME}/.local/bin:${PATH}

If this is the first time you are using Cloud ML with the current project, you need to initialize the ML module. This can be done in one line:

➜ gcloud beta ml init-project
Cloud ML needs to add its service accounts to your project (ml-lab-123456) as Editors. This will enable Cloud Machine Learning to access resources in your project when running your training and prediction jobs.
Do you want to continue (Y/n)?  
Added serviceAccount:cloud-ml-service@ml-lab-123456-1234a.iam.gserviceaccount.com as an Editor to project 'ml-lab-123456'.

To check if everything is installed successfully, you need to run one simple command:

➜ curl https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/tools/check_environment.py | python  
...
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Your active configuration is: [cloudshell-12345]
Success! Your environment is configured

Now it's time to decide which Google Cloud project you will use to train the network. I have a special project for all my experiments with ML. In any case, this is up to you, but I will show you my commands that I use to switch between projects:

➜  gprojects
PROJECT_ID             NAME            PROJECT_NUMBER
ml-lab-123456          ml-lab          123456789012
...
➜  gproject ml-lab-123456
Updated property [core/project].

If you want to use the same magic, then you need to add the following to your .bashrc / .zshrc / other_rc file:

function gproject() {
  gcloud config set project $1
}
function gprojects() {
  gcloud projects list
}

Well, if you are already here, it means that we prepared Cloud Shell and moved on to the desired project and now we can say with confidence that Cloud Shell is prepared and we can proceed to the next step with a clear conscience.

Cloud Storage Preparation

First of all, we need to explain why do we even need a storage cloud? Since we will train the model in the cloud, the training process will not have any access to the local file system of your current machine. This means that all necessary source data must be stored somewhere in the cloud. As well as the trained model, you will also need to store it somewhere. This somewhere cannot be the machine on which training takes place, for you do not have access to it; and your machine cannot be, because the learning process does not have access to it. Such a vicious circle that can be broken by introducing a new link - cloud storage for data.

Let's create a new cloud bucket that will be used for training:

➜ PROJECT_NAME=chatbot_generic
➜ TRAIN_BUCKET=gs://${PROJECT_NAME}
➜ gsutil mb ${TRAIN_BUCKET}
Creating gs://chatbot_generic/...

Here I have to tell you something, if you look at the official guide , you will find the following text there:

Warning: You must specify a region (like us-central1) for your bucket, not a multi-region location (like us).

Free translation:

Note: You must specify the region (us-central1) for your bucket, and not a mult-regional location as a country (for example: us).

However, if you use this advice and create a regional bucket, the script will not be able to write anything to it 0_o (keep silent the hussars, the bug has already been fixed ).

In an ideal world where everything works, it is expected that it is very important to establish a region, and it must correspond to the region that will be used during training. Otherwise, it can have a negative impact on the learning speed.

Now we are ready to prepare the input for the upcoming training.

Training data preparation

If you use your own network, then you can probably omit this part, or selectively read only the part where it is described where you need to download this data.

This time (compared to the previous article ) we will use a slightly modified version of the script that prepares the input data. I want to encourage you to read how the script works in a README file. But now, you can prepare the input in the following way (you can replace “td src” with “mkdir src; cd src”):

➜ td src
➜ ~/src$ git clone https://github.com/b0noI/dialog_converter.git
Cloning into 'dialog_converter'...
remote: Counting objects: 63, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 63 (delta 0), reused 0 (delta 0), pack-reused 59
Unpacking objects: 100% (63/63), done.
Checking connectivity... done.
➜ ~/src$ cd dialog_converter/
➜ ~/src/dialog_converter$ git checkout converter_that_produces_test_data_as_well_as_train_data
Branch converter_that_produces_test_data_as_well_as_train_data set up to track remote branch converter_that_produces_test_data_as_well_as_train_data from origin.
Switched to a new branch 'converter_that_produces_test_data_as_well_as_train_data'
➜ ~/src/dialog_converter$ python converter.py 
➜ ~/src/dialog_converter$ ls
converter.py  LICENSE  movie_lines.txt  README.md  test.a  test.b  train.a  train.b

Looking at the code above, you might ask what “td” is? .. This is just a short form of “to dir”, and this is one of the commands that I use most often. In order for you to use this magic, you need to update the rc file by adding the following:

function td() {
  mkdir $1
  cd $1
}

This time we will improve the quality of our model by dividing the data into 2 samples: a training sample and a test one. That is why we see four files instead of two, as it was the previous time.

Great, we finally have some data, let's upload it to the bucket:

➜ ~/src/dialog_converter$ gsutil cp test.* ${TRAIN_BUCKET}/input
Copying file://test.a [Content-Type=application/octet-stream]...
Copying file://test.b [Content-Type=chemical/x-molconn-Z]...                    
\ [2 files][  2.8 MiB/  2.8 MiB]      0.0 B/s                                   
Operation completed over 2 objects/2.8 MiB.                                      
➜ ~/src/dialog_converter$ gsutil cp train.* ${TRAIN_BUCKET}/input
Copying file://train.a [Content-Type=application/octet-stream]...
Copying file://train.b [Content-Type=chemical/x-molconn-Z]...                   - [2 files][ 11.0 MiB/ 11.0 MiB]                                                
Operation completed over 2 objects/11.0 MiB.                                     
➜ ~/src/dialog_converter$ gsutil ls ${TRAIN_BUCKET}
gs://chatbot_generic/input/
➜ ~/src/dialog_converter$ gsutil ls ${TRAIN_BUCKET}/input
gs://chatbot_generic/input/test.a
gs://chatbot_generic/input/test.b
gs://chatbot_generic/input/train.a
gs://chatbot_generic/input/train.b

Training script preparation

Now we can prepare a training script. We will use translate.py . However, its current implementation does not allow using it with Cloud ML, so a little refactoring is necessary. As usual, I created a feature request and prepared a brunch with all the necessary changes . And so, let's start by bowing it:

➜ ~/src/dialog_converter$ cd ..
➜ ~/src$ git clone https://github.com/b0noI/models.git
Cloning into 'models'...
remote: Counting objects: 1813, done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 1813 (delta 24), reused 0 (delta 0), pack-reused 1774
Receiving objects: 100% (1813/1813), 49.34 MiB | 39.19 MiB/s, done.
Resolving deltas: 100% (742/742), done.
Checking connectivity... done.
➜ ~/src$ cd models/
➜ ~/src/models$ git checkout translate_tutorial_supports_google_cloud_ml
Branch translate_tutorial_supports_google_cloud_ml set up to track remote branch translate_tutorial_supports_google_cloud_ml from origin.
Switched to a new branch 'translate_tutorial_supports_google_cloud_ml'
➜ ~/src/models$ cd tutorials/rnn/translate/

Please note that we are not using a master branch!

Learning Testing - Locally

Since distance learning costs money, for testing you can simulate the learning process - locally. The problem here is the truth that the local training of our network on a machine that runs Cloud Shell, of course, will crush it into the dirt and crush it. And you will have to overload the instance without seeing the result. But do not worry, even in this case nothing will be lost. Fortunately, our script has a self-test mode that we can use. Here's how to use it:

➜ ~/src/models/tutorials/rnn/translate$ cd ..
➜ ~/src/models/tutorials/rnn$ gcloud beta ml local train \
>   --package-path=translate \
>   --module-name=translate.translate \
>   -- \
>   --self_test
Self-test for neural translation model.

Pay attention to the folder from which we execute the command!

It seems that the self-test was completed successfully. Let's talk about the keys we used here:

package-path - the path to the python package that must be deployed to the remote machine in order to complete the training;
"-" - everything that follows will be sent as input arguments to your module;
self_test - tells the module to run a self-test without actual training.

Training

Finally we got to the most interesting part of the process, for the sake of which we, in fact, started this all. But we still have a small detail, we need to prepare all the necessary buckets that will be used in the learning process and set all local variables:

➜ ~/src/models/tutorials/rnn$ INPUT_TRAIN_DATA_A=${TRAIN_BUCKET}/input/train.a
➜ ~/src/models/tutorials/rnn$ INPUT_TRAIN_DATA_B=${TRAIN_BUCKET}/input/train.b
➜ ~/src/models/tutorials/rnn$ INPUT_TEST_DATA_A=${TRAIN_BUCKET}/input/test.a
➜ ~/src/models/tutorials/rnn$ INPUT_TEST_DATA_B=${TRAIN_BUCKET}/input/test.b
➜ ~/src/models/tutorials/rnn$ JOB_NAME=${PROJECT_NAME}_$(date +%Y%m%d_%H%M%S)
➜ ~/src/models/tutorials/rnn$ echo ${JOB_NAME}
chatbot_generic_20161224_203332
➜ ~/src/models/tutorials/rnn$ TRAIN_PATH=${TRAIN_BUCKET}/${JOB_NAME}
➜ ~/src/models/tutorials/rnn$ echo ${TRAIN_PATH}
gs://chatbot_generic/chatbot_generic_20161224_203332

It is important to note here that the name of our remote work (JOB_NAME) must be unique every time we begin training. Now let's change the current folder for translation (don't ask =)):

➜ ~/src/models/tutorials/rnn$ cd translate/

Now we are ready to begin training. Let's first write a command (but we will not execute it) and discuss its main keys:

gcloud beta ml jobs submit training ${JOB_NAME} \
  --package-path=. \
  --module-name=translate.translate \
  --staging-bucket="${TRAIN_BUCKET}" \
  --region=us-central1 \
  -- \
  --from_train_data=${INPUT_TRAIN_DATA_A} \
  --to_train_data=${INPUT_TRAIN_DATA_B} \
  --from_dev_data=${INPUT_TEST_DATA_A} \
  --to_dev_data=${INPUT_TEST_DATA_B} \
  --train_dir="${TRAIN_PATH}" \
  --data_dir="${TRAIN_PATH}" \
  --steps_per_checkpoint=5 \
  --from_vocab_size=45000 \
  --to_vocab_size=45000

First, discuss some of the new flags of the training team:

staging-bucket - bucket to be used during deployment; it makes sense to use the same bucket as for training;
region - the region where you want to start the learning process.

Also let's touch on the new flags that will be passed to the script:

from_train_data / to_train_data is the former en_train_data / fr_train_data, details can be found in the previous article ;
from_dev_data / to_dev_data - the same as from_train_data / to_train_data, but for test (or "dev", as they are called in the script) data that will be used to estimate losses after training;
train_dir - the folder in which the learning results will be saved;
steps_per_checkpoint - how many steps should be performed before saving temporary results. 5 - the value is too small, I set it only to verify that the learning process is going on without any problems. Later I will restart the process with a large value (200, for example);
from_vocab_size / to_vocab_size - to understand what it is, you need to read the previous article. There you will find out that the default value (40k) is less than the number of unique words in the dialogs, therefore this time we increased the size of the dictionary.

It seems that everything is ready to start training, so let's get started (you will need a little patience because the process takes some time) ...

➜ ~/src/models/tutorials/rnn/translate$ gcloud beta ml jobs submit training ${JOB_NAME} \
>   --package-path=. \
>   --module-name=translate.translate \
>   --staging-bucket="${TRAIN_BUCKET}" \
>   --region=us-central1 \
>   -- \
>   --from_train_data=${INPUT_TRAIN_DATA_A} \
>   --to_train_data=${INPUT_TRAIN_DATA_B} \
>   --from_dev_data=${INPUT_TEST_DATA_A} \
>   --to_dev_data=${INPUT_TEST_DATA_B} \
>   --train_dir="${TRAIN_PATH}" \
>   --data_dir="${TRAIN_PATH}" \
>   --steps_per_checkpoint=5 \
>   --from_vocab_size=45000 \
>   --to_vocab_size=45000
INFO    2016-12-24 20:49:24 -0800       unknown_task            Validating job requirements...
INFO    2016-12-24 20:49:25 -0800       unknown_task            Job creation request has been successfully validated.
INFO    2016-12-24 20:49:26 -0800       unknown_task            Job chatbot_generic_20161224_203332 is queued.
INFO    2016-12-24 20:49:31 -0800       service         Waiting for job to be provisioned.
INFO    2016-12-24 20:49:36 -0800       service         Waiting for job to be provisioned.
...
INFO    2016-12-24 20:53:15 -0800       service         Waiting for job to be provisioned.
INFO    2016-12-24 20:53:20 -0800       service         Waiting for job to be provisioned.
INFO    2016-12-24 20:53:20 -0800       service         Waiting for TensorFlow to start.
...
INFO    2016-12-24 20:54:56 -0800       master-replica-0                Successfully installed translate-0.0.0
INFO    2016-12-24 20:54:56 -0800       master-replica-0                Running command: python -m translate.translate --from_train_data=gs://chatbot_generic/input/train.a --to_train_data=gs://chatbot_generic/input/train.b --from_dev_data=gs://chatbot_generic/input/test.a --to_dev_data=gs://chatbot_generic/input/test.b --train_dir=gs://chatbot_generic/chatbot_generic_20161224_203332 --steps_per_checkpoint=5 --from_vocab_size=45000 --to_vocab_size=45000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Creating vocabulary /tmp/vocab45000 from data gs://chatbot_generic/input/train.b
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  processing line 100000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/train.b
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  tokenizing line 100000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/train.a
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  tokenizing line 100000
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/test.b
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Tokenizing data in gs://chatbot_generic/input/test.a
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Creating 3 layers of 1024 units.
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Created model with fresh parameters.
INFO    2016-12-24 20:56:21 -0800       master-replica-0                Reading development and training data (limit: 0).
INFO    2016-12-24 20:56:21 -0800       master-replica-0                  reading data line 100000

You can monitor the status of your training. To do this, simply open another tab in your Cloud Shell (or tmux window), then create the necessary variables and run the command:

➜ JOB_NAME=chatbot_generic_20161224_213143
➜ gcloud beta ml jobs describe ${JOB_NAME}
...

Now, if everything goes well, we can stop the work and restart it with more steps, for example 200, this is the default number. The new command will look like this:

➜ ~/src/models/tutorials/rnn/translate$ gcloud beta ml jobs submit training ${JOB_NAME} \
>   --package-path=. \
>   --module-name=translate.translate \
>   --staging-bucket="${TRAIN_BUCKET}" \
>   --region=us-central1 \
>   -- \
>   --from_train_data=${INPUT_TRAIN_DATA_A} \
>   --to_train_data=${INPUT_TRAIN_DATA_B} \
>   --from_dev_data=${INPUT_TEST_DATA_A} \
>   --to_dev_data=${INPUT_TEST_DATA_B} \
>   --train_dir="${TRAIN_PATH}" \
>   --data_dir="${TRAIN_PATH}" \
>   --from_vocab_size=45000 \
>   --to_vocab_size=45000

Conversation with a bot

Probably the biggest advantage of using Cloud Storage to preserve intermediate model states during training is the ability to start communication without interrupting the learning process.

Now, for an example, I will show how you can start chatting with a bot after just 1600 training iterations. This, by the way, is the only step to be performed on the local machine. I think the reasons are obvious =)

Here's how to do it:

mkdir ~/tmp-data
gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/translate.ckpt-1600.meta ~/tmp-data
...
gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/translate.ckpt-1600.index ~/tmp-data
...
gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/translate.ckpt-1600.data-00000-of-00001 ~/tmp-data
...
gsutil cp gs://chatbot_generic/chatbot_generic_20161224_232158/checkpoint ~/tmp-data
TRAIN_PATH=...
python -m translate.translate \
  --data_dir="${TRAIN_PATH}" \
  --train_dir="${TRAIN_PATH}" \
  --from_vocab_size=45000 \
  --to_vocab_size=45000 \
  --decode
Reading model parameters from /Users/b0noi/tmp-data/translate.ckpt-1600
> Hi there
you ? . . . . . . . .
> What do you want?
i . . . . . . . . .
> yes, you
i ? . . . . . . . .
> hi
you ? . . . . . . . .
> who are you?
i . . . . . . . . .
> yes you!
what ? . . . . . . . .
> who are you?
i . . . . . . . . .
>
you ' . . . . . . . .

The TRAIN_PATH variable should lead to the “tmp_data” folder, and the current directory should be “models / tutorials / rnn”.

As you can see, the chat bot is still far from perfect after just 1,600 steps. If you want to see how he can communicate after 50 thousand iterations, then I will refer you to the last article again, since the goal of this is not to train an ideal chat bot, but to learn how to train any network in the cloud using Google Cloud ML.

Post factum

I hope that my article helped you learn the intricacies of working with Cloud ML and Cloud Shell, and you can use them to train your networks. I also hope that you wrote it and liked it, if so, you can support me on my patreon page and / or by adding likes to the article and helping to spread it =)

If you notice any problems at any of the steps, please give know me so that I can quickly fix it.

Tags: