How We Won Intel RealSense Hackathon

    Once I wrote to Habr about various technologies for obtaining 3D images from one camera. I ended that article with the words: "I myself, however, still have not encountered any of these cameras, which is a pity and annoyance."
    And then, suddenly, less than a year later, Intel is conducting a seminar and hackathon in Moscow on the new generation of its 3D cameras (Intel RealSense). Curiosity leaped up: a colleague and I signed up for the event. As it turned out, not in vain. We won the hackathon and got the Developer version of the camera, which we are now tormenting.

    The article talks about two things:
    1. About the camera, its pros and cons; what can be done with it, and for what tasks it is not suitable.
    2. About the concept that we proposed at the hackathon and for which we won first place.


    You can talk a lot about the camera. It turned out to be more interesting than expectations, but not in terms of quality, but in terms of embedded mathematics. From global things:

    First, instead of Time of flight technology, there was a return to structured highlighting . I think that the accuracy for the same price is higher. Of course, ToF has more potential, but the cost of a high-quality sensor is too high there. Structured highlighting has an interesting pattern that moves in time:


    To be honest, I don’t understand how they level this picture in the IR stream. Probably some kind of inter-frame approximation, but this is clearly not noticeable.

    Secondly, a very cool math for highlighting fingers on the hands and facial contours. Face contours are highlighted through active appearance models, but, the algorithms are extended to the 3D area. This solution allowed to increase stability and accuracy. An interesting point: face extraction works on ordinary cameras. The accuracy, of course, is less, but in bright light it’s not bad. I liked the idea. As far as I know, there was simply no stable solution for active models that could be taken and used for free (although, of course, the 4 GB SDK stops).

    Face extraction occurs in two stages. First, a region with a face is searched by some quick algorithm (probably haar), then a face is stretched through an active form model. The process is resistant to facial expressions and glasses, not resistant to turns of more than 15-20 degrees.

    I also liked the finger-highlighting solution. There is no ideal stability, but the system is predictable, on its basis you can create completely working applications. Probably, it is less accurate than that of Leap Motion, but it has a large field of view. The solution is not perfect, ambiguities arise, models are crookedly pulled. Some of the built-in gestures are recognized every other time, while others are not recognized unless the system first sees a deployed hand. In the video below, I tried to highlight the problems.

    In my opinion, there is potential now. And if the quality of the allocation of hands will increase one and a half times, then such control will be comparable to the touchpad.

    Thirdly, I would like to note that Kinect and RealSense have different niches. Kinect is aimed at large spaces for working with people from afar. And RealSense is for direct interaction with the system where it is installed. In many respects, this determines the parameters of the 3D sensor.


    Not without cons. Since the current version is still not final, I would hope that they will be corrected.
    The first inconvenient moment - the drivers are still raw. At some point, the colleague with whom we went with the camera completely refused to work. It helped only the demolition of all drivers and reinstalling them several times.

    When the video stream is initialized, an error occurs periodically, the entire application freezes for 20-30 seconds and does not start. The error is fixed on two computers.

    The second point relates to face recognition. In my opinion, a lot of information is missing:
    1) Eyes - a mirror of the soul. Why is there no clear line of sight? There is a direction of the face, there is a highlighting of the position of the pupils (from where, theoretically, you can get a direction). But at angles of rotation of the head of more than 5 degrees, the position of the pupils begins to be approximated by the center of the eye, although this is clearly not indicated in any way. Of course, I would like the possibility of using the direction to be explicitly made in the API.
    2) Face extraction works only in two modes, in FACE_MODE_COLOR_PLUS_DEPTH and in FACE_MODE_COLOR. Why not FACE_MODE_IR_PLUS_DEPTH or at least FACE_MODE_IR? In low light conditions, the discharge stops working. Why you can’t use a mode for highlighting where the face is always visible well and steadily. Many people like to sit in front of a computer in a dark room.

    The third point is architecture. Maybe we did not understand something completely, but we were not able to simultaneously start face recognition and hand recognition. Any of these systems must be initialized separately.

    Fourth minus - not all declared frequencies work. Of course, you can try to pick up 640 * 480 * 300fps video from the camera. But it is not selected in this capacity and is not saved. I would like to list the operating modes.

    The fifth minus is a bit personalized for the subject where we often work - “biometrics”. If the laser wavelength was 800-900 nm, and not 700-600, as in a camera, it would be more visible than more biometric features of a person and a recognition system could be done directly on this camera.

    About the hackathon and our victory

    The hackathon began after an hour and a half of lectures and examples. There were a total of 40 people in 13 teams. Format: “give the camera, show the project in 6 hours”. Given that any video analytics is quite complex, this is not much. On the other hand, it is strange to hold such events in a different format (in total the entire event lasted about 8 hours, at the end the participants were very exhausted).

    Someone thoroughly prepared for the hackathon. Someone took projects made for other purposes and greatly modified them. All of our preparations with Vasya for the hackathon took about three hours. We sat, drank tea and thought what could be done. There were several ideas:

    1. Try to take some of our project and adapt a 3D camera to it. The problem was that all the projects were specifically for 2D recognition / analytics. And screwing the camera to them was clearly more than 4 hours. Plus, it is unclear how it could be beautifully furnished.
    2. Try to show a demo effect, some simple and classic application. Mouse / airplane control with eyes / hand. Face painting of whiskers / whiskers. The disadvantage of this option is that it is quite empty and uninteresting. Bring to a beautiful state for a long time, and a simple option will not be interesting to the public.
    3. Show a brand new idea, achievable only on this product. It is clear that in 4 hours of a hackathon such a thing cannot be pile up to the end. But there is an opportunity to show a demonstration effect. I liked this option the most. The main problem here is to come up with such an idea.

    One of the things that I like is the analysis of human condition. Perhaps you read one of my past articles on Habré. And of course, I suffered in the same direction. But doing motion enhancement through 3D is ineffective. But through 3D, you can shoot a lot of characteristics of a person sitting in front of the camera. But just how and where to apply it?

    The answer turned out to be surprisingly trivial and obvious, as soon as Vasya asked me: “what can a 3D camera in a car help with?” And then we just suffered. After all, a 3D camera in a car can:
    • Keep track of whether the driver falls asleep or not.
    • Keep an eye on the driver’s attention, for example, that when rebuilding the driver does not look in the rearview mirrors.
    • Recognize the driver’s gestures: the driver can stay on the road to move the map / control music.
    • Automatically adjust mirrors before driving.

    And the most amazing: no one has done this yet. There are systems for determining fatigue, in the last couple of years they have even started using video cameras. But with a 3d camera, you can more accurately determine the position of the driver's head, which allows you to control his reaction. In addition, one thing, falling asleep, and another - analysis of actions and assistance to the driver.
    In four hours they gathered a simple demonstration:

    For this idea and demonstration, we were suddenly given first place.

    If suddenly someone needs our sketches in 6 hours - here they are. There EmguCV + is connected there is a vermicelli code. The idea is awesome, but how to approach the problem of such a scale and level of integration is not clear. But such technologies may well become transitional to automobile robots.

    Also popular now: