ResNet50. Own implementation

    Hello. The neural network library is described in my last article . Here I decided to show how you can use the trained network from TF (Tensorflow) in your decision, and whether it is worth it.

    Under the cut, a comparison with the original implementation of TF, a demo application for recognizing pictures, well ... conclusions. Who cares, please.

    You can find out how ResNet works, for example, here .

    This is what the network structure looks like in numbers:



    The code turned out to be no simpler and no more complicated than python.

    C ++ code to create a network:
      auto net = sn::Net();
        net.addNode("In", sn::Input(), "conv1")
           .addNode("conv1", sn::Convolution(64, 7, 3, 2, sn::batchNormType::beforeActive, sn::active::none, mode), "pool1_pad")
           .addNode("pool1_pad", sn::Pooling(3, 2, sn::poolType::max, mode), "res2a_branch1 res2a_branch2a");
        convBlock(net, vector{ 64, 64, 256 }, 3, 1, "res2a_branch", "res2b_branch2a res2b_branchSum", mode);
        idntBlock(net, vector{ 64, 64, 256 }, 3, "res2b_branch", "res2c_branch2a res2c_branchSum", mode);
        idntBlock(net, vector{ 64, 64, 256}, 3, "res2c_branch", "res3a_branch1 res3a_branch2a", mode);
        convBlock(net, vector{ 128, 128, 512 }, 3, 2, "res3a_branch", "res3b_branch2a res3b_branchSum", mode);
        idntBlock(net, vector{ 128, 128, 512 }, 3, "res3b_branch", "res3c_branch2a res3c_branchSum", mode);
        idntBlock(net, vector{ 128, 128, 512 }, 3, "res3c_branch", "res3d_branch2a res3d_branchSum", mode);
        idntBlock(net, vector{ 128, 128, 512 }, 3, "res3d_branch", "res4a_branch1 res4a_branch2a", mode);
        convBlock(net, vector{ 256, 256, 1024 }, 3, 2, "res4a_branch", "res4b_branch2a res4b_branchSum", mode);
        idntBlock(net, vector{ 256, 256, 1024 }, 3, "res4b_branch", "res4c_branch2a res4c_branchSum", mode);
        idntBlock(net, vector{ 256, 256, 1024 }, 3, "res4c_branch", "res4d_branch2a res4d_branchSum", mode);
        idntBlock(net, vector{ 256, 256, 1024 }, 3, "res4d_branch", "res4e_branch2a res4e_branchSum", mode);
        idntBlock(net, vector{ 256, 256, 1024 }, 3, "res4e_branch", "res4f_branch2a res4f_branchSum", mode);
        idntBlock(net, vector{ 256, 256, 1024 }, 3, "res4f_branch", "res5a_branch1 res5a_branch2a", mode);
        convBlock(net, vector{ 512, 512, 2048 }, 3, 2, "res5a_branch", "res5b_branch2a res5b_branchSum", mode);
        idntBlock(net, vector{ 512, 512, 2048 }, 3, "res5b_branch", "res5c_branch2a res5c_branchSum", mode);
        idntBlock(net, vector{ 512, 512, 2048 }, 3, "res5c_branch", "avg_pool", mode);
        net.addNode("avg_pool", sn::Pooling(7, 7, sn::poolType::avg, mode), "fc1000")
           .addNode("fc1000", sn::FullyConnected(1000, sn::active::none, mode), "LS")
           .addNode("LS", sn::LossFunction(sn::lossType::softMaxToCrossEntropy), "Output");
    


    → The full code is available here.

    You can do it easier, load the network architecture and weights from files,

    like this:
     string archPath = "c:/cpp/other/skyNet/example/resnet50/resNet50Struct.json",
               weightPath = "c:/cpp/other/skyNet/example/resnet50/resNet50Weights.dat";
        std::ifstream ifs;
        ifs.open(archPath, std::ifstream::in);
        if (!ifs.good()){
            cout << "error open file : " + archPath << endl;
            system("pause");
            return false;
        }
        ifs.seekg(0, ifs.end);
        size_t length = ifs.tellg();
        ifs.seekg(0, ifs.beg);
        string jnArch; jnArch.resize(length);
        ifs.read((char*)jnArch.data(), length);
        // Create net
        sn::Net snet(jnArch, weightPath);
    


    Made an application for interest. You can download from here . The volume is large due to network weights. Sources are there, you can use as an example.

    The application was created only for the article, it will not be supported, therefore it was not included in the project repository.



    Now, what happened compared to TF.

    Indications after a run of 100 images, on average. Machine: i5-2400, GF1050, Win7, MSVC12.

    Values ​​of recognition results match up to the 3rd character.

    Test code
    CPU: time / img, ms GPU: time / img, ms CPU: RAM, MbGPU: RAM, Mb
    Skynet4101206001200
    Tensorflow250254001400


    In fact, everything is deplorable of course.

    For the CPU, I decided not to use MKL-DNN, I myself thought to finish it: redistributed the memory for sequential reading, loaded the vector registers to the maximum. Perhaps it was necessary to lead to matrix multiplication, and / or some other hacks. Rested here, at first it was worse, it would be more correct to use MKL all the same.

    On the GPU, time is spent copying memory from / to the memory of the video card, and not all operations are performed on the GPU.

    Conclusions that can be drawn from all this fuss:

    - not to show off, but to use well-known proven solutions, have come to mind already more or less like. He sat on mxnet himself once, and toiled with native use, more on that below;

    - Do not try to use the native C interface of ML frameworks. And use them in the language that the developers focused on, that is, python.

    An easy way to use the ML functionality from your language is to make a service process on python and send pictures to it on the socket, you get a division of responsibility and the absence of heavy code.

    Perhaps everything. The article was short, but the conclusions, I think, are valuable, and apply not only to ML.

    Thanks.

    PS:
    if anyone has the desire and strength to try all the same to catch up to TF, welcome !)

    PS2:
    dropped his hands early. He took a smoke break, took it again and everything worked out.
    For the CPU, casting to matrix multiplication helped, as I thought.
    For the GPU, I selected all operations in a separate lib, so that without copying to the CPU and vice versa, the only minus of this approach was that I had to rewrite (duplicate) all the operators, although some things coincided, but I did not connect them.
    In general, here's how now:
    CPU: time / img, ms GPU: time / img, ms CPU: RAM, MbGPU: RAM, Mb
    Skynet195fifteen600800
    Tensorflow250254001400

    That is, at least the inference turned out even faster than on TF.
    The test code has not changed.

    Also popular now: