NVIDIA Jetson Nano: tests and first impressions

    Hi, Habr.

    Relatively recently, in this, 2019, NVIDIA announced a single-board computer compatible with the Raspberry Pi form factor, focused on AI and resource-intensive calculations.

    After it appeared on sale, it became interesting to see how it works and what can be done on it. It’s not so interesting to use standard benchmarks, so we’ll come up with our own; for all tests, the source code is given in the text. For those who are interested in what happened, continued under the cut.


    For starters, the technical specifications from the NVIDIA website:

    From the interesting, here are a few points.

    The first is a GPU with 128 cores, respectively, on the board you can run GPU-oriented tasks, such as CUDA (supported and installed out of the box) or Tensorflow. The main processor is 4-core, and, as shown below, is quite good. 4GB memory shared between CPU and GPU.

    The second is compatibility with the Raspberry Pi. The board has a 40-pin connector with various interfaces (I2C, SPI, etc.), there is also a camera connector, which is also compatible with the Raspberry Pi. It can be assumed that a large number of existing accessories (screens, motor control boards, etc.) will work (you may have to use an extension cable, because Jetson Nano still differs in size).

    Thirdly, the board has 2 video outputs, Gigabit-Ethernet and USB 3.0, i.e. Jetson Nano as a whole is even a little more functional than the prototype. 5V power, can be taken both via Micro USB, or through a separate connector, which is recommended for mining bitcoinsresource-intensive tasks. As in the Raspberry Pi, the software is loaded from the SD card, the image of which must first be recorded. In general, ideologically, the board is quite similar to the Raspberry Pi, which apparently was conceived in NVIDIA. But there is no WiFi on the board, there is a definite minus, those who wish will have to use a USB-WiFi module.

    If you look closely, you can see that structurally the device consists of two modules - the Jetson Nano module itself, and the bottom board with connectors, the connection is through a connector.

    Those. the board can be disconnected and used separately, it can be convenient for embedded solutions.

    Speaking of price. The original price of Jetson Nano in the USA is $ 99, the price in Europe with a mark-up in local stores is about 130 Euro (if you get discounts, you can probably find cheaper). How much Nano costs in Russia is unknown.


    As mentioned above, the download and installation is not much different from the Raspberry Pi. We load the image onto the SD card via Etcher or Win32DiskImager, get into Linux, put the necessary libraries. An excellent step-by-step guide is here , I used it. Let's move on to the tests right away - try to run different programs under Nano, and see how they work. For comparison, I used three computers - my work laptop (Core I7-6500U 2.5GHz), Raspberry Pi 3B + and Jetson Nano.

    CPU Test

    First, a screenshot of the lscpu command.

    Raspberry Pi 3B +:

    Jetson nano:

    For calculations, let's start with something simple, but requiring processor time. For example, by calculating the number Pi. I took a simple Python program withstackoverflow .

    I don’t know whether it is optimal or not, but it doesn’t matter to us - we are interested in the relative time .

    Source code under the spoiler
    import time
    # Source: https://stackoverflow.com/questions/9004789/1000-digits-of-pi-in-python
    def make_pi():
        q, r, t, k, m, x = 1, 0, 1, 1, 3, 3
        for j in range(10000):
            if 4 * q + r - t < m * t:
                yield m
                q, r, t, k, m, x = 10*q, 10*(r-m*t), t, k, (10*(3*q+r))//t - 10*m, x
                q, r, t, k, m, x = q*k, (2*q+r)*x, t*x, k+1, (q*(7*k+2)+r*x)//(t*x), x+2
    t1 = time.time()
    pi_array = []
    for i in make_pi():
    pi_array = pi_array[:1] + ['.'] + pi_array[1:]
    pi_array_str = "".join(pi_array)
    print("PI:", pi_array_str)
    print("dT:", time.time() - t1)

    As expected, the program does not work fast. Result for Jetson Nano: 0.8c.

    Raspberry Pi 3B + showed a noticeably longer time: 3.06c. The “exemplary” laptop completed the task in 0.27s. In general, even without using a GPU, the main processor in Nano is quite good for its form factor. Those who wish can check on the Raspberry Pi 4, I do not have it available.

    Surely there are those who want to write in the comments that Python is not the best choice for such calculations, I repeat once again that it was important for us to compare the time, there is no need to minimize it. It is clear that there are programs that calculate the Pi number much faster.


    Let's move on to more interesting calculations using the GPU, for which of course (the board is from NVIDIA), we will use CUDA. PyCUDA library required some shamanism during installation, it did not find cuda.h, the use of the command “sudo env“ PATH = $ PATH “pip install pycuda” helped, maybe there is another way (more options were discussed on the devtalk.nvidia.com forum ).

    For the test, I took the simple program SimpleSpeedTest for PyCUDA, which simply counts the sines in a loop, it does nothing useful, but it’s quite possible to evaluate it, and its code is simple and clear.

    Source code under the spoiler
    # SimpleSpeedTest.py
    # https://wiki.tiker.net/PyCuda/Examples/SimpleSpeedTest
    import pycuda.driver as drv
    import pycuda.autoinit
    from pycuda.compiler import SourceModule
    import numpy
    import time
    blocks = 64
    block_size = 128
    nbr_values = blocks * block_size
    n_iter = 100000
    print("Calculating %d iterations" % (n_iter))
    # SourceModule SECTION
    # create two timers so we can speed-test each approach
    start = drv.Event()
    end = drv.Event()
    mod = SourceModule("""__global__ void gpusin(float *dest, float *a, int n_iter)
                              const int i = blockDim.x*blockIdx.x + threadIdx.x;
                              for(int n = 0; n < n_iter; n++) {
                                a[i] = sin(a[i]);
                              dest[i] = a[i];
    gpusin = mod.get_function("gpusin")
    # create an array of 1s
    a = numpy.ones(nbr_values).astype(numpy.float32)
    # create a destination array that will receive the result
    dest = numpy.zeros_like(a)
    start.record() # start timing
    gpusin(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1), block=(block_size,1,1) )
    end.record() # end timing
    # calculate the run length
    secs = start.time_till(end)*1e-3
    print("PyCUDA time and first three results:")
    print("%fs, %s" % (secs, str(dest[:3])))
    # use numpy the calculate the result on the CPU for reference
    a = numpy.ones(nbr_values).astype(numpy.float32)
    t1 = time.time()
    for i in range(n_iter):
        a = numpy.sin(a)
    print("CPU time and first three results:")
    print("%fs, %s" % (time.time() - t1, str(a[:3])))

    As you can see, the calculation is done using the GPU through CUDA and using the CPU through numpy.

    Jetson nano - 0.67c GPU, 13.3c CPU.
    Raspberry Pi 3B + - 41.85c CPU, GPU - no data, CUDA on RPi does not work.
    Notebook - 0.05s GPU, 3.08c CPU.

    Everything is quite expected. Calculations on the GPU are much faster than calculations on the CPU (still 128 cores), the Raspberry Pi lags quite significantly. Well, of course, no matter how much you feed the wolf, the elephant still has a laptop video card much faster than the card in Jetson Nano - it is likely that there are much more processing cores in it.


    As you can see, the NVIDIA board turned out to be quite interesting and very productive. It is slightly larger and more expensive than the Raspberry Pi, but if someone needs b about lshaya computing power in a compact size, it is well worth it. Of course, this is not always necessary - for example, to send the temperature to narodmon, the Raspberry Pi Zero is enough, and with multiple margins. So Jetson Nano does not claim to replace Raspberry and clones, but for resource-intensive tasks it is very interesting (it can be not only drones or mobile robots, but also, for example, a camera for a doorbell with face recognition).

    In one part, everything conceived did not fit. In the second part, there will be tests of the AI ​​part - tests of Keras / Tensorflow and tasks on classification and image recognition.

    Also popular now: