fralik August 3, 2010 at 03:53

Pedestrian detection

Pedestrian detection is mainly used in research on unmanned vehicles. The general goal of pedestrian detection is to prevent a vehicle from colliding with a person. On Habré recently there was a topic about " smart cars ". The creation of such systems is a very popular area of research ( Darpa challenge ). I am engaged in recognition of pedestrians for a similar project of smart cars. Obviously, the problem of pedestrian detection is software, and collision avoidance is hardware. In this article, I will only mention the software part, briefly talk about one way to detect people in the image and the classification algorithm.

Introduction

In my work I use two sensors: an infrared camera and a lidar . The human body temperature is usually higher than the environment. Therefore, the image from the infrared camera of a person can be easily localized. As a rule, it is easy to detect parts of the body that are not covered by clothing: the head and hands. But with the help of the camera alone, it is difficult to determine the size of the object, it is difficult to say how far the person is from the camera. Here lidar comes to the rescue. It measures the distance to objects.

Why do we need lidar? Let's look at our pictures for a start. The whole idea of image preprocessing comes down to localizing areas of interest. We don’t care what the whole image is like. We want to highlight several areas and work further with them. Ideally, the area of interest should cover the entire person. Knowing that the human head is warmer than the environment, we easily find it in the image. Next we need to estimate the size of the person. This is where the data from the lidar come to the rescue. Knowing the distance to the object, the focal length of the camera, the size of the object in the coordinates of the real world, it is easy to calculate the size of the object in pixels. We determined the size of the object in the coordinates of the real world equal to a rectangle of 2 by 1 meter in the belief that the average person fits into such a rectangle. But in the coordinate system of the image, the areas of interest are still of different sizes. Another scale transformation and finally all areas of interest not only cover the same area of the real world, but also have the same pixel dimensions.

Consider how to combine the data of two sensors: we find a hot region in the image (we assume that this is the head of a person), we calculate the angle at which the center of this region is located, we bring this angle to the lidar coordinate system and from this angle we obtain the distance to the object. To translate an angle from one coordinate system to another, the sensors must be calibrated. Instead of the actual calibration of the sensors, their specific location is used, in which the centers of the sensors coincide in the horizontal plane:

Of course, on the test machine everything is a little different. Firstly, the above figure shows the location of static sensors: their position does not change with time. Secondly, on our test machine another type of lidar is used - three-dimensional. It is installed in the middle of the car roof. The camera is mounted in front of the roof. Thus, the centers of the sensors can no longer be considered to be at one point. I see two options for solving this problem: parallel transfer data from the coordinate system of one sensor to the coordinate system of another sensor (having previously measured the distance between them), or (automatically) calibrate the sensors.

Extracting areas of interest

The extraction of features that are used for pattern recognition, and their classification take a lot of time. Processing a single frame with 6-7 objects in Matlaba can take a whole minute. For real-time oriented systems, such lengthy processing is unacceptable. The speed is greatly affected by the number of detected warm objects, and a person is not the only warm object. Parts of cars, windows, traffic lights can also stand out against the general temperature background. In this paper, the emphasis is on the speed of information processing. We need to quickly weed out a maximum of objects that are definitely not human. In this case, it is advisable not to miss a single real person. All remaining objects can then be classified using a full-fledged static classifier.

Hot areas in the image are detected using a method called “Maximally Stable Extremal Regions” (MECP from English. Maximally Stable Extremal Regions [1]). The original image is processed by a threshold function with a changing threshold value. The result is a new sequence of images, the size of which corresponds to the number of different threshold values (for example, for a monochrome image with pixel values from 0 to 255 we get 256 images). The first image in the sequence will be completely white. Black areas will appear further and the most recent image in the sequence will be completely black. The figure below shows this sequence in the form of animation:
Threshold function image sequence

White areas in the image are areas of extremum. We can analyze how long a particular extremum region has been present in the sequence of images. To do this, you can use another threshold function. For example, with a value of 10. If the extremum region is present in more than 10 images of the sequence, then this region is called the most stable extremum region.

Having found the most stable areas of interest, we can filter them out a little more: check the aspect ratio, discard objects far from the camera, process overlapping areas.

Source image	The most stable areas of extremum
Areas of interest	Filtered Areas of Interest

Dispersion

As a metric for classifying objects, “variance” is used [2]. The calculation of this metric takes little time and, moreover, its value is invariant to lighting conditions. It is considered by the formula $Dispersedness = \ frac {Perimiter ^ 2} {Square}$ . In the original work, the variance is calculated along the contour of the object. To obtain a contour from areas of interest, a Gaussian filter and a Sobel operator are applied sequentially. The decision on whether an image belongs to a particular class is made using a threshold function. Images of people have a lower dispersion value than images of parts of cars or buildings.

Conclusion

The results of the algorithm in pictures:

The test computer is equipped with an Intel core 2 duo processor with a frequency of 3 GHz, a 6 MB cache, and 2 GB RAM. Tests were conducted in the Matlab system. The average processing time of one frame is 64 ms. This means that in 1 second the system will be able to process approximately 16 frames. This, of course, is better than 1 frame per minute.

The following questions naturally arise: how reliable is the variance for classification, how will the time spent working on one frame increase when using a full-fledged classifier. I have no answers to these questions yet. Now I’m just working on it. There will be results - I will inform you!

Literature

[1] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in British Machine Vision Conference, 2002, pp. 384–396.
[2] AL Hironobu, AJ Lipton, H. Fujiyoshi, and RS Patil, “Moving Target Classification and Tracking From Real-Time Video,” in Applications of Computer Vision, 1998. WACV '98. Proceedings., Fourth IEEE Workshop on, October 1998, pp. 8-14.

Tags: