ErmIg February 20, 2018 at 10:49

Speeding up the Viola-Jones method

Recently, the Viola-Jones method, which for a long time has been the main way of detecting objects in the image, has receded under the onslaught of newer and more advanced algorithms. Nevertheless, the relevance of this method still persists in the present tense.

Yes, the cascading classifier based on Haar attributes (Viola-Jones method) is inferior in speed to the cascading LBP classifier. It is less accurate than a detector based on HOG features, and even more so, a detector based on convolutional neural networks. And yet he has a certain niche when accuracy is required higher than that of the LBP cascade, but the speed of more accurate detectors is not high enough. An equally important factor is that for the cascading Haar classifier there are a large number of already trained cascades, including the standard package of the OpenCV library. Therefore, the speed of this algorithm is very important. That prompted the author at one time to engage in its optimization.

Well, and what article about detecting faces can do without a photograph of Lena?

Productivity Ways

There are quite a few detailed descriptions of how the Viola-Jones method works, including on Habrahabr: source 1 , source 2 . Therefore, I will not dwell on his description in detail, who will be interested in reading this in the above sources. I would like to say a few words about the possible ways by which the Viola-Jones method can be accelerated.

The Viola-Jones method is not some kind of rigid fixed algorithm. Its cascades are formed as a result of training. Therefore, the speed of work can vary greatly depending on the complexity of the object and the size of the object that we want to detect. The simplest way to speed up its work is to set a limit on the minimum object or increase the step of the pyramid. However, these methods have side effects in the form of a decrease in accuracy or an increase in the number of false positives. Retraining of cascades can sometimes have a very significant effect in terms of accuracy and speed of work. However, this is a rather costly process associated with the collection of a large training sample. And the training itself is not to say a very fast process.
A fairly obvious way to accelerate is to process various parts of the image in parallel streams. This method has already been implemented in OpenCV, so there is nothing to do here.
Use a graphics accelerator for parallel processing. Here, too, there is already an implementation in OpenCV. The downside is the requirement for a graphics accelerator.
Use vector instructions of the SIMD processor to speed up the algorithm. Vector instructions are in almost all modern processors. Unfortunately, they are different in different processors - therefore you have to have different implementations for different processors. In addition, the effective use of vector instructions requires a significant alteration of the algorithm. This method is devoted to this article.

The original (scalar) version of the algorithm

If you take the source code of this algorithm from OpenCV, you can see numerous attempts to speed it up using various SIMD (SSE, AVX, NEON) instructions. I do not have accurate numerical estimates of how successful these attempts were. However, judging by the source codes, the acceleration was hardly significant, since either the vector instructions (only their scalar analogues) are not used in the optimizations , or the data are loaded into the vector registers elementwise , which fatally affects the overall performance of the algorithm.

And so, we consider a scalar version of the cascading Haar-classifier algorithm (its simplified version is given below) for a given scale:

void HaarDetect(const HaarCascade & hid, const Image & mask, const Rect & rect, int step, Image & dst)
{
  for (ptrdiff_t row = rect.top; row < rect.bottom; row += step)
  {
    size_t p_offset = row * hid.sum.stride / sizeof(uint32_t);
    size_t pq_offset = row * hid.sqsum.stride / sizeof(uint32_t);
    for (ptrdiff_t col = rect.left; col < rect.right; col += step)
    {
      if (mask.At(col, row) == 0)
        continue;
      float norm = Norm(hid, pq_offset + col);
      if (Detect(hid, p_offset + col, 0, norm) > 0)
        dst.At(col, row) = 1;
    }
  }
}

Here the image is scanned for all its points (on a small scale, for acceleration, as a rule, the image is scanned in steps of 2). For each point, a normalizing coefficient is calculated (depends on the brightness and contrast of the image in the vicinity of this point):

uint32_t Sum(uint32_t * const ptr[4], size_t offset)
{
  return ptr[0][offset] - ptr[1][offset] - ptr[2][offset] + ptr[3][offset];
}
float Norm(const HaarCascade & hid, size_t offset)
{
  float sum = float(Sum(hid.p, offset));
  float sqsum = float(Sum(hid.pq, offset));
  float q = sqsum*hid.windowArea - sum *sum;
  return q < 0.0f ? 1.0f : sqrtf(q);
}

And also the classification itself:

float WeightedSum(const WeightedRect & rect, size_t offset)
{
  int sum = rect.p0[offset] - rect.p1[offset] - rect.p2[offset] + rect.p3[offset];
  return rect.weight*sum;
}
int Detect(const HaarCascade & hid, size_t offset, int startStage, float norm)
{
  const HaarCascade::Stage * stages = hid.stages.data();
  const HaarCascade::Node * node = hid.nodes.data() + stages[startStage].first;
  const float * leaves = hid.leaves.data() + stages[startStage].first * 2;
  for (int i = startStage, n = (int)hid.stages.size(); i < n; ++i)
  {
    const HaarCascade::Stage & stage = stages[i];
    const HaarCascade::Node * end = node + stage.ntrees;
    float stageSum = 0.0;
    for (; node < end; ++node, leaves += 2)
    {
      const HaarCascade::Feature & feature = hid.features[node->featureIdx];
      float sum = WeightedSum(feature.rect[0], offset) + 
        WeightedSum(feature.rect[1], offset);
      if (feature.rect[2].p0)
        sum += WeightedSum(feature.rect[2], offset);
      stageSum += leaves[sum >= node->threshold*norm];
    }
    if (stageSum < stage.threshold)
      return -i;
  }
  return 1;
}

Here, Haar signs are calculated using pre-calculated integral images .

Algorithm Vectoring Methods

An analysis of the above algorithm shows that the main computing resources are spent on calculating the weighted integral sums ( WeightedSum function ). Therefore, they must first of all be vectorized. There are only 2 ways to do this:

We are trying to simultaneously calculate several Haar signs in one SIMD vector.
We are trying to simultaneously calculate at once several image points in one SIMD vector.

Featured vectorization

At first glance, it is most optimal to vectorize the calculation of Haar signs (by the way, this is done in OpenCV ). However, there is one thing: the data on which you need to perform calculations are randomly scattered in memory. Therefore, they must first be collected from there using scalar read operations, and then scatter the results of the calculation using scalar write operations. You can certainly use the gather-scatter operations (they are in AVX2 and AVX-512), but their speed is comparable, and sometimes slower, than if you just use the scalar code.

Point vectorization

Vectorization by image points is not so obvious (in fact, we carry out vectorization almost along the outermost loop). However, the data in the integrated image for neighboring points lie nearby, which allows them to be efficiently loaded using vector operations. In addition, the method is easily scalable for different lengths of the SIMD vector. On this, the pluses end, and then the list of problems:

The effectiveness of the Viola-Jones algorithm is based on the fact that obviously wrong points are discarded at each stage. The proportion of discarded points depends on the training parameters, and by default is 50%. If we process a SIMD vector with several points, then we will be forced to continue the calculation of the stages until all its elements are discarded. This reduces the effectiveness of the method in the calculation of subsequent stages. Fortunately, the stages at which neighboring points are discarded correlate quite strongly. In addition, if the vector has only one point left for verification, then we can move on to the scalar version of the algorithm, which in this case will be more efficient.
In order to optimize for a small scale, the original algorithm from OpenCV checks the points in step 2. This speeds up the scalar code, but reduces the efficiency of vector instructions, since in the first stage we have to do half the calculations for nothing. Fortunately, this problem has a rather elegant solution. But we will write it in more detail a little later.

SIMD version of the algorithm

In this article I will give a simplified version of the algorithm for SSE4.1 , versions for AVX2 , AVX-512 and NEON use the same approach.

To begin, consider the case when we scan all points of the image without gaps:

void DetectionHaarDetect32fp(const HidHaarCascade & hid, const Image & mask, const Rect & rect, Image & dst)
{
  size_t rightAligned = rect.left + Simd::AlignLo(rect.Width(), 4);
  for (ptrdiff_t row = rect.top; row < rect.bottom; row += 1)
  {
    size_t p_offset = row * hid.sum.stride / sizeof(uint32_t);
    size_t pq_offset = row * hid.sqsum.stride / sizeof(uint32_t);
    size_t col = rect.left;
    for (; col < rightAligned; col += 4)
    {
      __m128i result = _mm_loadu_si128((__m128i*)(mask.Row(row) + col));
      if (_mm_testz_si128(result, _mm_set1_epi32(1)))
        continue;
      __m128 norm = Norm32fp(hid, pq_offset + col);
      Detect32f(hid, p_offset + col, norm, result);
      _mm_storeu_si128((__m128i*)(dst.Row(row) + col), result);
    }
    for (; col < rect.right; col += 1)
    {
      if (mask.At(col, row) == 0)
        continue;
      float norm = Norm32f(hid, pq_offset + col);
      if (Detect(hid, p_offset + col, 0, norm) > 0)
        dst.At(col, row) = 1;
    }
  }
}

Here, in contrast to the scalar version, the image is scanned not by single points, but by blocks of 4 points (only the edge points are processed in a scalar way). The normalization coefficient for a block of four points is calculated as follows:

inline __m128 ValidSqrt(__m128 value)
{
  return _mm_blendv_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(value), _mm_cmpgt_ps(value, _mm_set1_ps(0.0f)));
}
inline __m128i Sum32ip(uint32_t * const ptr[4], size_t offset)
{
  __m128i s0 = _mm_loadu_si128((__m128i*)(ptr[0] + offset));
  __m128i s1 = _mm_loadu_si128((__m128i*)(ptr[1] + offset));
  __m128i s2 = _mm_loadu_si128((__m128i*)(ptr[2] + offset));
  __m128i s3 = _mm_loadu_si128((__m128i*)(ptr[3] + offset));
  return _mm_sub_epi32(_mm_sub_epi32(s0, s1), _mm_sub_epi32(s2, s3));
}
inline __m128 Norm32fp(const HidHaarCascade & hid, size_t offset)
{
  __m128 area = _mm_set1_ps(hid.windowArea);
  __m128 sum = _mm_cvtepi32_ps(Sum32ip(hid.p, offset));
  __m128 sqsum = _mm_cvtepi32_ps(Sum32ip(hid.pq, offset));
  return ValidSqrt(_mm_sub_ps(_mm_mul_ps(sqsum, area), _mm_mul_ps(sum, sum)));
}

The classification itself for a block of four points:

inline int ResultCount(__m128i result)
{
  uint32_t SIMD_ALIGNED(16) buffer[4];
  _mm_store_si128((__m128i*)buffer, _mm_sad_epu8(result, _mm_setzero_si128()));
  return buffer[0] + buffer[2];
}
inline __m128 WeightedSum32f(const WeightedRect & rect, size_t offset)
{
  __m128i s0 = _mm_loadu_si128((__m128i*)(rect.p0 + offset));
  __m128i s1 = _mm_loadu_si128((__m128i*)(rect.p1 + offset));
  __m128i s2 = _mm_loadu_si128((__m128i*)(rect.p2 + offset));
  __m128i s3 = _mm_loadu_si128((__m128i*)(rect.p3 + offset));
  __m128i sum = _mm_sub_epi32(_mm_sub_epi32(s0, s1), _mm_sub_epi32(s2, s3));
  return _mm_mul_ps(_mm_cvtepi32_ps(sum), _mm_set1_ps(rect.weight));
}
inline void StageSum32f(const float * leaves, float threshold, const __m128 & sum, const __m128 & norm, __m128 & stageSum)
{
  __m128 mask = _mm_cmplt_ps(sum, _mm_mul_ps(_mm_set1_ps(threshold), norm));
  stageSum = _mm_add_ps(stageSum, _mm_blendv_ps(_mm_set1_ps(leaves[1]), _mm_set1_ps(leaves[0]), mask));
}
void Detect32f(const HidHaarCascade & hid, size_t offset, const __m128 & norm, __m128i & result)
{
  typedef HidHaarCascade Hid;
  const float * leaves = hid.leaves.data();
  const Hid::Node * node = hid.nodes.data();
  const Hid::Stage * stages = hid.stages.data();
  for (int i = 0, n = (int)hid.stages.size(); i < n; ++i)
  {
    const Hid::Stage & stage = stages[i];
    if (stage.canSkip)
      continue;
    const Hid::Node * end = node + stage.ntrees;
    __m128 stageSum = _mm_setzero_ps();
    if (stage.hasThree)
    {
      for (; node < end; ++node, leaves += 2)
      {
        const Hid::Feature & feature = hid.features[node->featureIdx];
        __m128 sum = _mm_add_ps(WeightedSum32f(feature.rect[0], offset), WeightedSum32f(feature.rect[1], offset));
        if (feature.rect[2].p0)
          sum = _mm_add_ps(sum, WeightedSum32f(feature.rect[2], offset));
        StageSum32f(leaves, node->threshold, sum, norm, stageSum);
      }
    }
    else
    {
      for (; node < end; ++node, leaves += 2)
      {
        const Hid::Feature & feature = hid.features[node->featureIdx];
        __m128 sum = _mm_add_ps(WeightedSum32f(feature.rect[0], offset), WeightedSum32f(feature.rect[1], offset));
        StageSum32f(leaves, node->threshold, sum, norm, stageSum);
      }
    }
    result = _mm_andnot_si128(_mm_castps_si128(_mm_cmpgt_ps(_mm_set1_ps(stage.threshold), stageSum)), result);
    int resultCount = ResultCount(result);
    if (resultCount == 0) // Если в векторе не осталось ни одного кандидата, прекращаем поиск:
    {
      return;
    }
    else if (resultCount == 1) // Если в векторе осталась только 1 точка, переходим к скалярной версии алгоритма:
    {
      uint32_t SIMD_ALIGNED(16) _result[4];
      float SIMD_ALIGNED(16) _norm[4];
      _mm_store_si128((__m128i*)_result, result);
      _mm_store_ps(_norm, norm);
      for (int j = 0; j < 4; ++j)
      {
        if (_result[j])
        {
          _result[j] = Detect32f(hid, offset + j, i + 1, _norm[j]) > 0 ? 1 : 0;
          break;
        }
      }
      result = _mm_load_si128((__m128i*)_result);
      return;
    }
  }
}

As you can see, the algorithm is almost the same as for the scalar version (of course, adjusted for the use of SSE vectors instead of ordinary real numbers).

Now consider the situation where we need to perform the same algorithm in step 2. The simplest option is to load two vectors of 4 numbers from memory, and then form one of them (leaving only even numbers). However, it shows completely unsatisfactory performance. Fortunately, you can do a little trick with reordering the original data - in the original integrated images we move even points to the beginning of the line, and odd ones to the end. Then loading vectors with even / odd elements turns into a trivial operation. In addition, we can use the same code for scanning.

Optimization data can easily be similarly implemented for a different vector size (8 for AVX2 and 16 for AVX-512 ) and for other platforms (for example, NEON for the ARM platform).

Optimization Results

The table below shows the scan times (in milliseconds) of the image for different implementations of the algorithm. For testing, an image of 1920x1080 size was used. Scanning was performed without scaling the original image:

Function	Scalar	SSE4.1	AVX2	AVX-512
Common	227.499	111.509	74.952	54.579
DetectionHaarDetect32fi (0)	84.792	45.771	32.716	25.961
DetectionHaarDetect32fi (1)	162.779	79.007	50.996	36.841
DetectionHaarDetect32fp (0)	329.904	159.355	109.725	80.862
DetectionHaarDetect32fp (1)	588.270	268.298	172.396	114.735

Here, the numbers in parentheses (0) and (1) indicate the results for different cascades used for testing. DetectionHaarDetect32fp - a function that scans an image in steps of 1, DetectionHaarDetect32fi - in steps of 2. Common - geometric mean.

The following table shows the acceleration achieved relative to the base (scalable) version:

Function	SSE4.1 / Scalar	AVX2 / Scalar	AVX-512 / Scalar
Common	2.04	3.04	4.17
DetectionHaarDetect32fi (0)	1.85	2.59	3.27
DetectionHaarDetect32fi (1)	2.06	3.19	4.42
DetectionHaarDetect32fp (0)	2.07	3.01	4.08
DetectionHaarDetect32fp (1)	2.19	3.41	5.13

Well, the relative acceleration for the appetizer, when the vector size is doubled:

Function	SSE4.1 / Scalar	AVX2 / SSE4.1	AVX-512 / AVX2
Common	2.04	1.49	1.37
DetectionHaarDetect32fi (0)	1.85	1.40	1.26
DetectionHaarDetect32fi (1)	2.06	1.55	1.38
DetectionHaarDetect32fp (0)	2.07	1.45	1.36
DetectionHaarDetect32fp (1)	2.19	1.56	1.50

From the table, the decreasing utility from each subsequent doubling of the vector is clearly visible.

Software implementation

In conclusion, I would like to say a few words about the software implementation of the above algorithms. In the framework of the Simd project , the C ++ class Simd :: Detection was implemented . In it, under the hood, there is a low-level work with the Simd library library API. One of the important advantages is the compatibility of the used cascades (HAAR and LBP) with cascades from OpenCV , as well as ease of use. The following is an example of face detection using this framework:

#include 
#include 
#include "opencv2/opencv.hpp"
#define SIMD_OPENCV_ENABLE
#include "Simd/SimdDetection.hpp"
#include "Simd/SimdDrawing.hpp"
int main(int argc, char * argv[])
{
  if (argc < 2)
  {
    std::cout << "You have to set video source! It can be 0 for camera or video file name." << std::endl;
    return 1;
  }
  std::string source = argv[1];
  cv::VideoCapture capture;
  if (source == "0")
    capture.open(0);
  else
    capture.open(source);
  if (!capture.isOpened())
  {
    std::cout << "Can't capture '" << source << "' !" << std::endl;
    return 1;
  }
  typedef Simd::Detection Detection;
  Detection detection;
  detection.Load("../../data/cascade/haar_face_0.xml");
  bool inited = false;
  const char * WINDOW_NAME = "FaceDetection";
  cv::namedWindow(WINDOW_NAME, 1);
  for (;;)
  {
    cv::Mat frame;
    capture >> frame;
    Detection::View image = frame;
    if (!inited)
    {
      detection.Init(image.Size(), 1.2, image.Size() / 20);
      inited = true;
    }
    Detection::Objects objects;
    detection.Detect(image, objects);
    for (size_t i = 0; i < objects.size(); ++i)
      Simd::DrawRectangle(image, objects[i].rect, Simd::Pixel::Bgr24(0, 255, 255));
    cv::imshow(WINDOW_NAME, frame);
    if (cvWaitKey(1) == 27)// "press 'Esc' to break video";
      break;
  }
  return 0;
}

Tags: