lytr January 28, 2015 at 14:53

Deep learning and Caffe on New Year's holidays

From the sandbox

Motivation

In this article, you will learn how to use deep learning in practice. The Caffe framework on the SVHN dataset will be used .

Deep learning This buzz word has been ringing in my ears for a long time, but I could not try it in practice. Turned up a good opportunity to fix it! For the New Year holidays, a kaggle contest was recognized for recognizing house numbers as part of an image analysis course.

A part of the well-known SVHN sample was given , consisting of 73257 images in the training and 26032 in the test (unallocated) samples. Only 10 classes for each digit. The image is 32x32 in the RGB color space. As the benchmark shows, methods based on deep learning show accuracy higher than that of a person - 1.92% against 2% error!

I had experience with machine learning algorithms based on SVM and Naive Bayes. It’s boring to apply known methods, so I decided to use something from deep learning, namely a convolutional neural network .

Caffe Choice

There are many different libraries and frameworks for working with deep neural networks. My criteria were as follows:

tutorials
ease of development
ease of deployment
active community.

Caffe walked perfectly on them:

Good tutorials are on their site . Separately recommend lectures from the Caffe Summer Bootcamp . For a quick start, you can read about the foundations of neural networks and then about Caffe .
You don't even need a programming language to get started with Caffe. Caffe is configured using configuration files, and launched from the command line.
For deployment there is a chef-cookbook and docker-images .
Active development is underway at the github , and in the Google group you can ask a question about using the framework.

In addition, Caffe is very fast, because uses a GPU (although you can get by with a CPU).

Installation

Initially, I installed Caffe on my laptop using docker and ran it in CPU mode. Neural network training was very slow, but there was nothing to compare with and it seemed that this was normal.

Then I stumbled upon a $ 25 Amazon coupon and decided to try on AWS g2.2xlarge with an NVIDIA GPU and CUDA support. There deployed Caffe with Chef . As a result, it turned out 41 times faster - on the CPU 100 iterations took 290 seconds, on the GPU with CUDA in 7 sec!

Neural network architecture

If in machine learning algorithms it was necessary to form a good vector of features in order to get an acceptable quality, then this is not necessary in convolutional neural networks. The main thing is to come up with a good network architecture.

We introduce the following notation:

input - input layer, usually these are image pixels,
conv - convolution layer [ 1 ],
pool - subsample layer [ 2 ],
fully-conn - fully connected layer [ 3 ],
output - the output layer, produces the intended image class.

The main architecture for the classification of images is the following NS architecture:

    input -> conv -> pool -> conv -> pool -> fully-conn -> fully-conn -> output

The number (conv -> pool) of layers can be different, but usually not less than 2x. The number of fully-conn is not less than 1st.

As part of this contest, several architectures were tried. I got the most accuracy with the following:

    input -> conv -> pool -> conv -> pool -> conv -> pool -> fully-conn -> fully-conn -> output

Caffe Architecture Implementation

Caffe is configured using Protobuf files. The implementation of architecture for the contest is here . Consider the key points in the configuration of each layer.

Input layer

Input Layer Configuration

name: "WinnyNet-F"
layers {
  name: "svhn-rgb"
  type: IMAGE_DATA
  top: "data"
  top: "label"
  image_data_param {
    source: "/home/deploy/opt/SVHN/train-rgb-b.txt"
    batch_size: 128
    shuffle: true
  }
  transform_param {
    mean_file: "/home/deploy/opt/SVHN/svhn/winny_net5/mean.binaryproto"
  }
  include: { phase: TRAIN }
}
layers {
  name: "svhn-rgb"
  type: IMAGE_DATA
  top: "data"
  top: "label"
  image_data_param {
    source: "/home/deploy/opt/SVHN/test-rgb-b.txt"
    batch_size: 120
  }
  transform_param {
    mean_file: "/home/deploy/opt/SVHN/svhn/winny_net5/mean.binaryproto"
  }
  include: { phase: TEST }
}
...

The first 2 layers (for the training and test phases) have type: IMAGE_DATA, i.e. input network accepts images. Images are listed in a text file , where 1 column is the path to the image, 2 column is the class. The path to the text file is specified in the image_data_param attribute.

In addition to images, you can input data from HDF5 , LevelDB and lmbd. The last 2 options are especially relevant if the speed of work is critical. Thus, Caffe can work with any data, not just images. The easiest way to work is with IMAGE_DATA, which is why it was chosen for the contest.

Input layers may also include the transform_param attribute. It indicates the transformations to which the input data must be subjected. Usually, before submitting images to a neural network, they are normalized or trickier operations are performed, for example, Local Contrast Normalization . In this case, mean_file was specified - subtracting the "average" image from the input.

Caffe uses a batch gradient descent . The input layer contains the batch_size parameter. In one iteration, the batch_size of the sample elements arrives at the input of the neural network.

Layers of convolution and subsampling (conv, pool)

Configure convolution and subsampling layers

    ...
    layers {
      bottom: "data"
      top: "conv1/5x5_s1"
      name: "conv1/5x5_s1"
      type: CONVOLUTION
      blobs_lr: 1
      blobs_lr: 2
      convolution_param {
        num_output: 64
        kernel_size: 5
        stride: 1
        pad: 2
        weight_filler {
          type: "xavier"
          std: 0.0001
        }
      }
    }
    layers {
      bottom: "conv1/5x5_s1"
      top: "conv1/5x5_s1"
      name: "conv1/relu_5x5"
      type: RELU
    }
    layers {
      bottom: "conv1/5x5_s1"
      top: "pool1/3x3_s2"
      name: "pool1/3x3_s2"
      type: POOLING
      pooling_param {
        pool: MAX
        kernel_size: 3
        stride: 2
      }
    }
    layers {
      bottom: "pool1/3x3_s2"
      top: "conv2/5x5_s1"
      name: "conv2/5x5_s1"
      type: CONVOLUTION
      blobs_lr: 1
      blobs_lr: 2
      convolution_param {
        num_output: 64
        kernel_size: 5
        stride: 1
        pad: 2
        weight_filler {
          type: "xavier"
          std: 0.01
        }
      }
    }
    layers {
      bottom: "conv2/5x5_s1"
      top: "conv2/5x5_s1"
      name: "conv2/relu_5x5"
      type: RELU
    }
    layers {
      bottom: "conv2/5x5_s1"
      top: "pool2/3x3_s2"
      name: "pool2/3x3_s2"
      type: POOLING
      pooling_param {
        pool: MAX
        kernel_size: 3
        stride: 2
      }
    }
    layers {
      bottom: "pool2/3x3_s2"
      top: "conv3/5x5_s1"
      name: "conv3/5x5_s1"
      type: CONVOLUTION
      blobs_lr: 1
      blobs_lr: 2
      convolution_param {
        num_output: 128
        kernel_size: 5
        stride: 1
        pad: 2
        weight_filler {
          type: "xavier"
          std: 0.01
        }
      }
    }
    layers {
      bottom: "conv3/5x5_s1"
      top: "conv3/5x5_s1"
      name: "conv3/relu_5x5"
      type: RELU
    }
    layers {
      bottom: "conv3/5x5_s1"
      top: "pool3/3x3_s2"
      name: "pool3/3x3_s2"
      type: POOLING
      pooling_param {
        pool: MAX
        kernel_size: 3
        stride: 2
      }
    }
    ...

3m is a convolution layer with type: CONVOLUTION. The following is an indication of the activation function c type: RELU. The 4th layer is a subsample layer with type: POOL. Next 2 times is a repeat of conv, pool layers, but with different parameters.

The selection of parameters for these layers is empirical.

Fully-connected and output layers (fully-conn, output)

Fully connected and output layer configuration

    ...
    layers {
      bottom: "pool3/3x3_s2"
      top: "ip1/3072"
      name: "ip1/3072"
      type: INNER_PRODUCT
      blobs_lr: 1
      blobs_lr: 2
      inner_product_param {
        num_output: 3072
        weight_filler {
          type: "gaussian"
          std: 0.001
        }
        bias_filler {
          type: "constant"
        }
      }
    }
    layers {
      bottom: "ip1/3072"
      top: "ip1/3072"
      name: "ip1/relu_5x5"
      type: RELU
    }
    layers {
      bottom: "ip1/3072"
      top: "ip2/2048"
      name: "ip2/2048"
      type: INNER_PRODUCT
      blobs_lr: 1
      blobs_lr: 2
      inner_product_param {
        num_output: 2048
        weight_filler {
          type: "xavier"
          std: 0.001
        }
        bias_filler {
          type: "constant"
        }
      }
    }
    layers {
      bottom: "ip2/2048"
      top: "ip2/2048"
      name: "ip2/relu_5x5"
      type: RELU
    }
    layers {
      bottom: "ip2/2048"
      top: "ip3/10"
      name: "ip3/10"
      type: INNER_PRODUCT
      blobs_lr: 1
      blobs_lr: 2
      inner_product_param {
        num_output: 10
        weight_filler {
          type: "xavier"
          std: 0.1
        }
      }
    }
    layers {
      name: "accuracy"
      type: ACCURACY
      bottom: "ip3/10"
      bottom: "label"
      top: "accuracy"
      include: { phase: TEST }
    }
    layers {
      name: "loss"
      type: SOFTMAX_LOSS
      bottom: "ip3/10"
      bottom: "label"
      top: "loss"
    }

A fully connected layer has type: INNER_PRODUCT. The output layer is connected to the layer by a loss function (type: SOFTMAX_LOSS) and an accuracy layer (type: ACCURACY). The accuracy layer only works in the test phase and shows the percentage of correctly classified images in the validation sample.

It is important to specify the weight_filler attribute. If it is large, the loss function can return NaN at initial iterations. In this case, you need to reduce the std parameter of the weight_filler attribute.

Training Options

Learning configuration

    net: "/home/deploy/opt/SVHN/svhn/winny-f/winny_f_svhn.prototxt"
    test_iter: 1
    test_interval: 700
    base_lr: 0.01
    momentum: 0.9
    weight_decay: 0.004
    lr_policy: "inv"
    gamma: 0.0001
    power: 0.75
    solver_type: NESTEROV
    display: 100
    max_iter: 77000
    snapshot: 700
    snapshot_prefix: "/mnt/home/deploy/opt/SVHN/svhn/snapshots/winny_net/winny-F"
    solver_mode: GPU

To get a well-trained neural network, you need to set learning parameters. In Caffe, training parameters are set via the protobuf configuration file. The configuration file for this contest is here . There are many parameters , we will consider some of them in more detail:

net - the path to the configuration of the architecture of the National Assembly,
test_interval - the number of iterations between which the NS is tested (phase: test),
snapshot - the number of iterations between which the learning state of the NS is preserved.
At Caffe, you can pause and resume training.

Training and testing

To start training NS, you need to run the caffe train command with the configuration file , where the training parameters are set:

> caffe train --solver=/home/deploy/winny-f/winny_f_svhn_solver.prototxt

Brief training log

    .......................
    I0109 18:12:17.035543 12864 solver.cpp:160] Solving WinnyNet-F
    I0109 18:12:17.035578 12864 solver.cpp:247] Iteration 0, Testing net (#0)
    I0109 18:12:17.077910 12864 solver.cpp:298]     Test net output #0: accuracy = 0.0666667
    I0109 18:12:17.077997 12864 solver.cpp:298]     Test net output #1: loss = 2.3027 (* 1 = 2.3027 loss)
    I0109 18:12:17.107712 12864 solver.cpp:191] Iteration 0, loss = 2.30359
    I0109 18:12:17.107795 12864 solver.cpp:206]     Train net output #0: loss = 2.30359 (* 1 = 2.30359 loss)
    I0109 18:12:17.107817 12864 solver.cpp:516] Iteration 0, lr = 0.01
    .......................
    I0109 18:13:17.960325 12864 solver.cpp:247] Iteration 700, Testing net (#0)
    I0109 18:13:18.045385 12864 solver.cpp:298]     Test net output #0: accuracy = 0.841667
    I0109 18:13:18.045462 12864 solver.cpp:298]     Test net output #1: loss = 0.675567 (* 1 = 0.675567 loss)
    I0109 18:13:18.072872 12864 solver.cpp:191] Iteration 700, loss = 0.383181
    I0109 18:13:18.072949 12864 solver.cpp:206]     Train net output #0: loss = 0.383181 (* 1 = 0.383181 loss)
    .......................
    I0109 20:08:50.567730 26450 solver.cpp:247] Iteration 77000, Testing net (#0)
    I0109 20:08:50.610496 26450 solver.cpp:298]     Test net output #0: accuracy = 0.916667
    I0109 20:08:50.610571 26450 solver.cpp:298]     Test net output #1: loss = 0.734139 (* 1 = 0.734139 loss)
    I0109 20:08:50.640389 26450 solver.cpp:191] Iteration 77000, loss = 0.0050708
    I0109 20:08:50.640470 26450 solver.cpp:206]     Train net output #0: loss = 0.0050708 (* 1 = 0.0050708 loss)
    I0109 20:08:50.640494 26450 solver.cpp:516] Iteration 77000, lr = 0.00197406
    .......................
    I0109 20:52:32.236827 30453 solver.cpp:247] Iteration 103600, Testing net (#0)
    I0109 20:52:32.263108 30453 solver.cpp:298]     Test net output #0: accuracy = 0.883333
    I0109 20:52:32.263183 30453 solver.cpp:298]     Test net output #1: loss = 0.901031 (* 1 = 0.901031 loss)
    I0109 20:52:32.290550 30453 solver.cpp:191] Iteration 103600, loss = 0.00463345
    I0109 20:52:32.290627 30453 solver.cpp:206]     Train net output #0: loss = 0.00463345 (* 1 = 0.00463345 loss)
    I0109 20:52:32.290644 30453 solver.cpp:516] Iteration 103600, lr = 0.00161609

One era is (73257-120) / 128 ~ = 571 iteration. In a little more than 1 era, at 700 iterations, the network accuracy on the validation sample is 84%. At the 134th era, accuracy is already 91%. At the 181 epoch - 88%. Perhaps if you train the network for more eras, for example 1000, the accuracy will stabilize and will be higher. In this contest, training was stopped at the 181 epoch.

In Caffe, you can resume network training from snapshot by adding the --snapshot option:

> caffe train --solver=/home/deploy/winny-f/winny_f_svhn_solver.prototxt
              --snapshot=winny_net/winny-F_snapshot_77000.solverstate

Testing on unallocated images

To test the NS, you must create a deploy configuration of the network architecture . In it, unlike the previous configuration, there is no accuracy layer and the input layer is simplified.

The test sample, consisting of 26032 images, goes without markup. Therefore, in order to evaluate the accuracy on the test selection of the contest, you need to write some code . Caffe has interfaces for Python and Matlab .

Caffe has snapshots for testing networks from different eras. The 134 era network showed an accuracy (Private Score in kaggle) of 88.7%, and the 181 era network - 87.6%.

Precision Ideas

Judging by the master's thesis , the accuracy of the implemented architecture can reach 96%.

How can you try to increase the obtained accuracy of 88.7%?

Train the network for more eras. For example, in the deep learning tutorial in facial keypoints detection, the network taught 1000 eras.
Standardize the data so that the expectation is 0 and the variance is 1. To do this, you need to use HDF5 or LevelDb / lmdb to store data.
Work with learning options. For example, reduce learning_rate every 100 eras.
You can also try using dropout layers, but for this you need to train the network even more eras than 1000 .
The SVHN dataset contains an additional 600,000 tagged images. They are used in research, but as part of the contest, their use would be unfair. In this case, new data can be generated based on the available data.

Conclusion

The realized convolutional neural network showed an accuracy of 88.9%. This is not the best result, but not bad for the first pancake. There is potential for increasing accuracy up to 96%.

Thanks to the Caffe framework, immersion in deep learning does not cause much difficulty. It is enough to create a couple of configuration files and start the learning process with one command. Of course, basic knowledge in the theory of artificial neural networks is also needed. I tried to give this (in the form of links to materials) and other information for a quick start in this article.

Tags: