Depth cameras - silent revolution (when robots will see) Part 1




    Recently, I described, thanks to which robots tomorrow will start MUCH better to think (a post about hardware acceleration of neural networks ). Today we’ll see why robots will soon be much better to see. In some situations, much better than a person.

    We will talk about depth cameras that shoot video, in each pixel of which is stored not the color, but the distance to the object at this point. Such cameras have existed for more than 20 years, but in recent years the speed of their development has grown many times and we can already talk about the revolution. And multi-vector. Rapid development is taking place in the following areas:

    Who cares what it will look like, as well as a comparison of different approaches and their current and tomorrow’s application - welcome under the cut!

    So! We will analyze the main directions of development of depth chambers or actually different principles for measuring depth. With their pros and cons.

    Method 1: Structured Light Camera


    Let's start with one of the simplest, oldest and relatively cheap methods of measuring depth - structured light. This method appeared essentially right away, as soon as digital cameras appeared, i.e. over 40 years ago and greatly simplified a bit later, with the advent of digital projectors.

    The basic idea is extremely simple. We put a projector next to it, which creates, for example, horizontal (and then vertical) stripes and a camera next to it, which takes a picture of the stripes, as shown in this figure:

    Source: Autodesk: Structured Light 3D Scanning 

    Since the camera and the projector are offset from each other, then the strips will also shift in proportion to the distance to the object. By measuring this displacement, we can calculate the distance to the object:
    Source:http://www.vision-systems.com/

    In fact, with the cheapest projector (and their price starts at 3,000 rubles) and a smartphone, you can measure the depth of static scenes in a dark room: 

    Source: Autodesk: Structured Light 3D Scanning 

    It is clear that you’ll have to solve a whole bunch of problems — projector calibration, phone camera calibration, strip shift recognition, and so on, but all these tasks are quite affordable even for advanced high school students learning programming. 

    This principle of measuring depth became the most widely known when, in 2010, Microsoft released the MS Kinect depth sensor for $ 150, which at that time was revolutionary cheap. 

    A source:Partially Occluded Object Reconstruction using Multiple Kinect Sensors

    Despite the fact that in addition to actually measuring the depth with an IR projector and an IR camera, Kinect also shot regular RGB video, had four microphones with noise reduction and could adjust itself to a person in height, automatically tilting up up or down, data processing was built in right inside, which immediately issued a ready depth map to the console:

    Source: Implementation of natural user interface buttons using Kinect

    In total, about 35 million devices were sold, making Kinect the first mass depth camera in history. And if you consider that there were certainly depth cameras, but usually they sold a maximum of hundreds and cost at least an order of magnitude more expensive - this was a revolution that provided great investment in this area. 

    An important reason for the success was that by the time Microsoft released the Xbox 360, there were already a few games that actively used Kinect as a sensor. Takeoff was swift:


    Moreover, Kinect even managed to enter the Guinness Book of Records as the fastest-selling gadget in history. True, Apple soon ousted Microsoft from this place, but nonetheless. For a new experimental sensor that works in addition to the main device to become the fastest-selling electronic device in history, this is simply an excellent achievement:


    When I give lectures, I like to ask the audience where all these millions of customers come from? Who were all these people? 

    As a rule, no one guesses, but sometimes, especially if the audience is older and more experienced, they give the correct answer: sales were driven by American parents, who saw with delight that their children could play on the console and not sit on the couch with a thick booty, and jumping in front of the TV. It was a breakthrough !!! Millions of mothers and fathers rushed to order a device for their children. 

    In general, when it comes to gesture recognition, people usually naively believe that just data from a 2D camera is enough. After all, they saw a lot of beautiful demos! Reality is much more severe. The recognition accuracy of gestures from a 2D video stream from a camera and the accuracy of recognition of gestures from a camera depth differ by an order of magnitude. From a depth camera, or rather, from an RGB camera combined with a depth camera (the latter is important), you can recognize gestures much more accurately and at a lower cost (even if the room is dark) and this has brought success to the first mass depth camera.

    About Kinect on Habré at the time they wrote a lot , so very briefly how it works. 

    An infrared projector gives a pseudo-random set of points in space, the displacement of which determines the depth in a given pixel:


    Source: Depth Sensing Planar Structures: Detection of Office Furniture Configurations 

    The camera resolution is stated as 640x480, but really there is somewhere around 320x240 with fairly strong filtering and the picture on real examples looks like this (i.e. pretty scary):

    Source: Partially Occluded Object Reconstruction using Multiple Kinect Sensors

    The “shadows” from the objects are clearly visible, since the camera and the projector are far enough apart. It can be seen that shifts of several points of the projector are taken to predict the depth. In addition, there is (hard) filtering by immediate neighbors, but still the depth map is quite noisy, especially at the borders. This leads to a fairly noticeable noise on the surface of the resulting objects, which must be additionally and nontrivially smoothed: 

    Source: J4K Java Library for the Microsoft's Kinect SDK

    And nevertheless, only $ 150 ( today it’s already $ 69 , although it’s better to get closer to $ 200 , of course) - and you “see” the depth! There are really a lot of serial products .

    By the way, in February of this year, a new Azure Kinect was announced.:

    Source: Microsoft announces Azure Kinect, available for pre-order now.

    Its deliveries to developers in the USA and China should begin on June 27, i.e. literally right now. Of the capabilities, in addition to the noticeably better resolution of RGB and better quality of depth cameras (they promise 1024x1024 at 15 FPS and 512x512 at 30 FPS and higher quality is clearly visible by the demo , the ToF camera) support for the collaboration of several devices out of the box is declared, less illumination on the sun, the error is less than 1 cm at a distance of 4 meters and 1-2 mm at a distance of less than 1 meter, which sounds extremely interesting, so we are waiting, waiting:

    Source: Introducing Azure Kinect DK

    The next massivethe product where the depth camera was realized in a structured light was not a game console, but ... (drum roll) correctly - iPhone X !

    Its Face ID technology is a typical depth camera with a typical infrared Dot projector and infrared camera (by the way, now you understand why they are on the edges of the bangs, spaced as far as possible from each other - this is a stereo base ): The


    resolution of the depth map is even less, than Kinect - about 150x200. It is clear that if you say: “Our resolution is about 150x200 pixels or 0.03 megapixels,” people will say briefly and succinctly: “Sucks!” And if you say "Dot projector: More than 30,000 invisible dots are projected onto your face", people say: "Wow, 30 thousand invisible points, cool!". Some blondes will ask if freckles appear from invisible points. And the topic will go to the masses! Therefore, the second option was farsighted in advertising. The resolution is small for three reasons: firstly, the requirements of miniature, secondly, energy consumption, and thirdly, prices. 

    Nevertheless, this is another depth camera in a structured light, which has gone into a series of millions of copies and has already been repeated by other smartphone manufacturers, for example, (surprise-surprise!) Huawei (which bypassed Apple in smartphone sales last year). Only Huawei has a camera on the right and the projector on the left, but also, of course, along the edges of the bang:

    Source: Huawei Mate 20 Pro update lets users add a second face for face unlock

    At the same time, 300,000 points are declared, that is , 10 times more than Apple , and the front camera is better, and the font is larger . Is there an exaggeration regarding 300 thousand - it's hard to say, but Huawei demonstrates a very good 3D scan of objects with a front camera . Independent tests are more scary , but this is clearly the very beginning of the topic and the infancy of the technology of miniature energy-efficient depth cameras and camera announcements at the end of this year is already noticeably better in performance. 

    At the same time, it is understandable why the face identification technology was used in phones. Firstly, now you can’t deceive the detector by showing a photo of your face (or video from the tablet). Secondly, the face changes a lot when the lighting changes, but its shape does not, which allows us to more accurately identify the person along with the data from the RGB camera:

    Source: photo of the same face from TI materials

    Obviously, the infrared sensor has inherent problems. Firstly, our relatively weak projector shines on the sun for one or two times, so these cameras do not work on the street. Even in the shade, if the white wall of a building is lit by the sun, you can have big problems with Face ID. The noise level in Kinect also rolls over even when the sun is covered by clouds:
     
    Source: this and the following two pictures -Basler AG materials

    Another big problem is reflection and re-reflection. Since infrared light is also reflected, it will be problematic to shoot an expensive stainless steel kettle, a varnished table or a glass shade with Kinect:


    And, finally, two cameras that shoot one object can interfere with each other. Interestingly, in the case of structured light, you can make the projector flicker and understand where our points are and where not, but this is a separate and rather complicated story:


    Now you know how to break FaceID ...

    However, for mobile devices, structured light looks like the most reasonable compromise for today :

    Source: Smartphone Companies Scrambling to Match Apple 3D Camera Performance and Cost

    For structured light, the cheapness of a conventional sensor is such that its use in most cases is more than justified. What brought to life a large number of startups operating according to the formula: cheap sensor + complex software = quite acceptable result.

    For example, our former graduate student Maxim Fedyukov , who has been engaged in 3D reconstruction since 2004, created Texelwhose main product is a platform with 4 Kinect cameras and software that turns a person into a potential monument in 30 seconds. Well, or a desktop figurine. This is who has enough money. Or you can send cheap and cheerful friends photos of your 3D model to your friends (for some reason, the most popular case for some reason). Now they send their platforms and software abroad from the UK to Australia:

    Source: Creating a 3D model of a person in 30 seconds.

    As a ballerina, I can’t stand beautifully, so I just pensively look at the fin of a shark swimming past:

    Source: author’s materials

    Generally, a new kind of sensors gave rise to new art projects. In winter, I saw a rather curious VR film shot with Kinect. Below is an interesting dance visualization, also done with Kinect (it seems that 4 cameras were used), and unlike the previous example, they didn’t fight noise, they rather added fun specifics: 

    Source: A Dance Performance Captured With a Kinect Sensor and Visualized With 3D Software

    What trends can be observed in the area:
    • As you know, digital sensors of modern cameras are sensitive to infrared radiation, so you have to use special blocking filters so that infrared noise does not spoil the picture (even the direction of artistic shooting in the infrared range appears , including when the filter is removed from the sensor). This means that huge amounts of money are invested in miniaturization, increased resolution and cheaper sensors, which can be used as infrared (with a special filter ). 
    • Similarly, algorithms for processing depth maps are now rapidly improving, including the methods of so-called cross-filtering, when data from an RGB sensor and noisy data by depth allow you to get a very good video of depth together. At the same time, using neural network approaches, it becomes possible to dramatically increase the speed of obtaining a good result. 
    • All the top companies work in this area, especially smartphone manufacturers. 

    Consequently:
    • We can expect a dramatic increase in the resolution and accuracy of shooting Structured Light depth cameras in the next 5 years.
    • There will be a (though slower) reduction in the energy consumption of mobile sensors, which will simplify the use of next-generation sensors in smartphones, tablets and other mobile devices.

    In any case, what we are seeing now is the infancy of technology. The first mass products on which the debugging of production and use of a new unusual data type is just being launched - video with depth.

    Method 2: Time of Flight Camera


    The next way to get depth is more interesting. It is based on the measurement of round-trip light delay (ToF - Time-of-Flight ). As you know, the speed of modern processors is high, and the speed of light is small. In one clock cycle of the processor at 3 GHz, the light manages to fly just 10 centimeters. Or 10 measures per meter. A lot of time, if anyone was engaged in low-level optimization. Accordingly, we set the pulsed light source and a special camera:

    Source: The Basler Time-of-Flight (ToF) Camera 

    In fact, we need to measure the delay at which the light returns to each point:

    Source: The Basler Time-of-Flight (ToF) Camera 

    Or, if we have several sensors with different charge accumulation times, then, knowing the time shift relative to the source for each sensor and the shot flash brightness, we can calculate the shift and, accordingly, the distance to the object, and increasing the number of sensors, we increase the accuracy:


    Source : Larry Li “Time-of-Flight Camera - An Introduction”

    The result is a camera with LED or, more rarely, laser ( VCSEL ) infrared illumination:

    Source: a very good description of ToF on allaboutcircuits.com

    At the same time, the picture is obtained at a fairly low resolution (after all, we need to place several sensors next to them with different polling times), but potentially with high FPS. And the problems are mainly at the boundaries of objects (which is typical for all depth cameras). But without the “shadows” typical of structured light:

    Source: Basler AG video

    In particular, it was precisely this type of camera (ToF) that at one time actively tested Google in the Google Tango project , which was well represented in this video . The meaning was simple - combine the data of the gyroscope, accelerometer, RGB camera and depth camera, building a three-dimensional scene in front of the smartphone:

    Source: Google's Project Tango Is Now Sized for Smartphones 

    The project itself did not go (my opinion was that it was somewhat ahead of its time), but it created important prerequisites for creating a wave of interest in AR - augmented reality - and, accordingly, developing sensors that can work with it. Now all his achievements are poured into Google's ARCore .

    In general, the market for ToF cameras grows by about 30% every 3 years, which is an exponential growth for itself, and few markets grow so fast:

    Source: Potential of Time-of-Flight Cameras & Market Penetration

    A serious driver of the market today is the rapid (and also exponential) development of industrial robots, for which ToF cameras are an ideal solution. For example, if your robot packs boxes, then with an ordinary 2D camera, determining that you are starting to jam the cardboard is an extremely non-trivial task. And for a ToF camera, it’s trivial to “see” and process it. And very fast. As a result, we are witnessing a boom in industrial ToF cameras :


    Naturally, this also leads to the appearance of home products using depth cameras. For example, a security camera with a night video unit and a ToF depth camera from the German PMD Technologies , which has been developing 3D cameras for more than 20 years :

    Source:3D Time-of-Flight Depth Sensing Brings Magic to the New Lighthouse Smart Home Camera 

    Remember the invisibility cloak that Harry Potter was hiding under?

    Source: Harry Potter's Invisibility Cloak Gets an Origin Story and May Soon Exist in Real Life

    I'm afraid that the German camera will detect it for one or two. And putting a screen with a picture in front of such a camera will be difficult (this is not a distracting guard for you):

    Source: Fragment of the film Mission Impossible: Phantom Protocol

    It seems that for new CCTV cameras it will take Hogwarts' childish magic to trick them with a ToF depth camera capable of in complete darkness to shoot such a video:

    Прикинуться стенкой, экраном и прочими способами защититься от того, что комбинированная ToF+RGB камера засечет посторонний объект, становится технически кардинально сложнее. 

    Другое массовое мирное применение камер глубины — распознавание жестов. В ближайшее время можно ожидать телевизоров, приставок и роботов-пылесосов, которые будут в состоянии воспринимать не только голосовые команды, как умные колонки, но и небрежное «убери там!» со взмахом руки. Тогда пульт дистанционного управления (он же ленивчик) к смарт телевизору станет окончательно не нужен, и фантастика войдет в жизнь. В итоге то, что было фантастичным в 2002 году, стало экспериментальным в 2013, и, наконец, серийным в 2019 (при этом пипл не будет знать, что внутри камера глубины, what's the difference how this magic works? ):

    Source: article , experiments and product

    A full line of applications is even wider, of course:

    Source: video of depth sensors from Terabee (by the way, what kind of mice do they run on the floor for 2 and 3 videos? See them? Just kidding, it's dust in the air - a fee for the small size of the sensor and the proximity of the light source to the sensor)

    By the way - there are also a lot of cameras in the famous Amazon Go “stores without cashiers” under the ceiling: 

    Source: Inside Amazon's surveillance-powered, no-checkout convenience store

    And, as writes TechCrunch : “They're augmented by separatedepth-sensing cameras (using a time-of-flight technique , or so I understood from Kumar) that blend into the background like the rest, all matte black. ” That is, the miracle of determining which shelf the yogurt is taken from is provided, among other things, by mysterious black matte ToF cameras (a good question, are they in the photo): 


    Unfortunately, it is often difficult to find direct information. But there is an indirect one. For example, there was a company Softkinetic , which since 2007 has been developing ToF cameras. 8 years later, they were bought by Sony (which, incidentally, is ready to conquer new markets under the Sony Depthsensing brand). So one of the top employeesSoftkinetic now works just on Amazon Go. Such a coincidence! Within a couple of years, when the technology is brought up and the main patents are filed, the details will most likely be revealed.

    Well, as usual, the Chinese ignite. Company Pico Zense , for example, introduced at CES 2019 a very impressive lineup of ToF cameras, including for outdoor use:

    They promise a revolution everywhere. Trucks will be loaded denser due to automated loading, ATMs will be safer, due to depth cameras in each, navigation of robots will become easier and more accurate, people (and, most importantly, children!) Will be counted an order of magnitude better in the stream, new fitness simulators will appear c the ability to control the correctness of the exercises without an instructor, and so on and so forth. Naturally, cheap Chinese cameras of a new generation depth are already ready for all this magnificence. Take and build in!

    Interestingly, the latest serial Huawei P30 Pro has a ToF sensor next to the main cameras, i.e. the long-suffering Huawei is better able to make Apple make frontal structured light sensors and seems to be more successful than Google (Project Tango, which was closed) introduced a ToF camera next to the main cameras: 

    Source: review of new Huawei technologies by Ars Technica at the end of March 2019

    Details of use, of course, were not disclosed, but in addition to accelerating focusing (which is important for three main cameras with different lenses), this sensor can be used to increase the quality of blurring the background of photographs (imitation of small depth of field ). 

    It is also obvious that the next generation of depth sensors next to the main cameras will be used in AR applications, which will increase the accuracy of the AR from the current “cool, but often buggy” to a mass-working level. And, obviously, in light of the Chinese successes, the big question is how much Google will want to support in ARCorerevolutionary chinese iron. Patent wars can significantly slow down the technology market. The development of this dramatic story we will see literally in the next two years. 

    Subtotals


    About 25 years ago, when the first automatic doors appeared, I personally watched how quite respectable uncles periodically accelerated in front of such doors. Succeeds to open or does not have time? She is big, heavy, glass! About the same thing I observed during a tour of quite respectable professors at an automatic factory in China recently. They lagged a bit behind the group to see what would happen if you stood at the robot, peacefully carrying parts and playing a quiet pleasant melody on the way. I, too, repent, could not resist ... You know, it stops! Maybe smoothly. Maybe as a dead man. Depth sensors work!

    Source: Inside Huawei Technology's New Campus

    The hotel also had cleaning robots that looked something like this: 

    At the same time, they were bullied more strongly than robots at the factory. Not as tough as in the inhumane in every sense of Bosstown Dynamics , of course. But I personally watched how they got up on the road, the robot tried to go around a person, the person moved, blocking the road ... Sort of cat and mouse. In general, it seems that when unmanned vehicles appear on the roads, the first time they will be cut more often than usual ... Oh, people-people ... Hmmm ... However, we were distracted.

    Summarizing the key points:
    • В силу другого принципа работы, мы можем располагать источник света в ToF камере максимально близко к сенсору (даже под той же линзой). Кроме того, у многих промышленных моделей светодиоды располагаются вокруг сенсора. Как следствие, кардинально уменьшаются, а то и вообще пропадают «тени» на карте глубины. Т.е. упрощается работа со сложными по геометрии объектами, что важно для промышленных роботов.
    • Поскольку импульсная подсветка остается, как правило, инфракрасной — сохраняются все изложенные в прошлом разделе минусы инфракрасной камеры: засветка на солнце, сложности при работе двух камер рядом и т.д. Впрочем, промышленные роботы часто работают в помещениях, а камеры с лазерной подсветкой разрабатываются. 
    • Увы, ToF сенсорам сложнее «сесть на хвост» общему улучшению сенсоров RGB камер, поэтому их развитие идет медленнее, но на удивление уверенно и новостей про внедрение ToF камер ОЧЕНЬ много и чего (кого) там только нет (только в смартфонах проанонсировали интеграцию сенсоров и Samsung, и Google Pixel, и Sony Xperia...).
    • У новой Sony обещают, что 2 камеры из 8 камер телефона (!!!) будут ToF камерами глубины (!), т.е. камеры глубины будут с обеих сторон телефона:
      Источник: Hexa-cam Sony phone gets camera specs revealed
    • As a result, we will find a lot of interesting things in this area even in the coming year! And next year, up to 20% of new phones will be with depth cameras (Structured Light + ToF). If we consider that in 2017 only Apple was on the market in splendid isolation with “30 thousand points”, and now they do not put less than 300 thousand, the topic clearly went well:

      Source: Limited Smartphone 3D Sensing Market Growth in 2019; Apple to be Key Promoter of Growth in 2020

    Do you still doubt the ongoing revolution? 

    This was the first part! A general comparison will be in the second.

    In the next series, wait:
    • Method 3, classic: depth of stereo;
    • Method 4, newfangled: depth from plenoptics;
    • Method 5, fast-growing: lidars, including Solid State Lidars;
    • Some problems processing video with depth;
    • And finally, a brief comparison of all 5 methods and general conclusions.


    Carthage must be broken ... The whole video will be three-dimensional by the end of the century!

    Stay tuned! (If I have enough time, I will describe new cameras, including tests of fresh Kinect, by the end of the year.)

    Part 2

    Acknowledgments
    I would like to cordially thank:
    • Laboratory of Computer Graphics VMK Moscow State University MV Lomonosov for his contribution to the development of computer graphics in Russia in general and work with depth cameras in particular,
    • Microsoft, Apple, Huawei and Amazon for great depth camera-based products,
    • Texel for the development of high-tech Russian products with depth cameras,
    • personally Konstantin Kozhemyakov, who did a lot to make this article better and more visual,
    • and, finally, many thanks to Roman Kazantsev, Eugene Lyapustin, Egor Sklyarov, Maxim Fedyukov, Nikolai Oplachko and Ivan Molodetsky for a large number of sensible comments and corrections that made this text much better!


    Also popular now: