Licenzero: porn detective
We have a great job - we get paid to watch pornographic videos. But seriously, we work in the R&D department of Inventos , which is engaged in the automatic filtering of web content: moderation, copyright protection, etc. Our task was to build a system for automatically detecting pornographic content. Here we describe how we solved the task.
Having familiarized ourselves with the various implementations of the search for pornography in video that are currently available, we decided to approach the issue comprehensively, that is, to use different signs of pornography. The video is passed through several detectors, each of which returns an estimate of the “pornography” of the video, of course, with different accuracy. Then, the resulting estimates are combined into one final.
We decided not to evaluate the entire movie, but to look for small fragments. The fragment size was determined based on the accuracy of the final classification.
This approach with several detectors allows you to combine them, add new ones and work on each of them separately. To date, the system consists of four detectors:
Each of these detectors returns the probability that our fragment is pornographic. And it remains only to calculate the overall probability.
Now a little more detail about each detector individually.
The search for rhythmic movement in the frame is where we started our work. But first, a few words about the classification itself. The essence of the classification is to divide a certain set of objects into two (in our case) classes. To do this, we:
So, everything is simple. That is, at first there was the task of obtaining fragments with rhythmic pornography (it was not difficult to collect fragments without porn labor). A number of videos were watched, scenes with a characteristic rhythmic movement were cut and saved. That took 60 man-hours (for classification, the more objects - the better).
The technical details of the search for rhythmic movement will be described in the following articles. Here we note that the basis of our method is the use of space-time filters.
With color, things are easier than with movement. Each point in the picture has coordinates in a certain color space. We simply determine where a point with such coordinates is more common: in the image of a naked human body or in other areas of the picture. Based on these data, we obtain a characteristic of the video fragment being filled with naked bodies of people. We also will not touch on a specific implementation now, just say a few words about the color space used. We settled on the YUV color model because:
When looking for pornography, you can not ignore individual frames. You need to look for something there too. To extract useful information directly from the staff, we decided to use Bag of Visual Words . That is, the “visual words” are first determined - fragments or samples that best characterize frames with and without porn. It turns out such a set of visual words. And then, during the classification, our detector by the presence of certain words in the picture gives an estimate of the pornography of this frame.
The sound detector is based on two main parameters that help us recognize pornography:
Thus, we can judge the presence of moans (of course, with some probability) in the sound fragment. That is, according to these two parameters, our detector classifies the fragment.
And it's all? Of course not. This is just an introduction. We just decided not to pile up all the technical details for different detectors, but to describe them in separate articles. Because the detectors are fundamentally different, the work on them was carried out separately, and the scope of work (and hence the scope of the description) was different.
So, to be continued:
Licenzero: simple movements
Licenzero: looking for porn by skin color
General classification approach
Having familiarized ourselves with the various implementations of the search for pornography in video that are currently available, we decided to approach the issue comprehensively, that is, to use different signs of pornography. The video is passed through several detectors, each of which returns an estimate of the “pornography” of the video, of course, with different accuracy. Then, the resulting estimates are combined into one final.
We decided not to evaluate the entire movie, but to look for small fragments. The fragment size was determined based on the accuracy of the final classification.
This approach with several detectors allows you to combine them, add new ones and work on each of them separately. To date, the system consists of four detectors:
- the nature of the movement (its rhythm) ;
- color (the number of pixels "skin color" in the frame) ;
- the content of the frame (characteristic forms in the picture);
- sound (the presence of groans).
Each of these detectors returns the probability that our fragment is pornographic. And it remains only to calculate the overall probability.
Now a little more detail about each detector individually.
Nature of movement
The search for rhythmic movement in the frame is where we started our work. But first, a few words about the classification itself. The essence of the classification is to divide a certain set of objects into two (in our case) classes. To do this, we:
- we take a training set of objects that we classify manually;
- create a procedure for selecting the parameters of the statistical model;
- we train our model on a training set of objects;
- to evaluate the accuracy of the model, we test on a test set.
So, everything is simple. That is, at first there was the task of obtaining fragments with rhythmic pornography (it was not difficult to collect fragments without porn labor). A number of videos were watched, scenes with a characteristic rhythmic movement were cut and saved. That took 60 man-hours (for classification, the more objects - the better).
The technical details of the search for rhythmic movement will be described in the following articles. Here we note that the basis of our method is the use of space-time filters.
Color
With color, things are easier than with movement. Each point in the picture has coordinates in a certain color space. We simply determine where a point with such coordinates is more common: in the image of a naked human body or in other areas of the picture. Based on these data, we obtain a characteristic of the video fragment being filled with naked bodies of people. We also will not touch on a specific implementation now, just say a few words about the color space used. We settled on the YUV color model because:
- color coordinates are only two (U and V);
- discarding the brightness coordinate (Y), we may not take into account the different brightness of objects;
- no need to perform additional conversion when working with video.
Frame content
When looking for pornography, you can not ignore individual frames. You need to look for something there too. To extract useful information directly from the staff, we decided to use Bag of Visual Words . That is, the “visual words” are first determined - fragments or samples that best characterize frames with and without porn. It turns out such a set of visual words. And then, during the classification, our detector by the presence of certain words in the picture gives an estimate of the pornography of this frame.
Sound
The sound detector is based on two main parameters that help us recognize pornography:
- The presence of the sound of a human (mainly female) voice.
- Rhythmic repetition of specific sounds. For this, we use the calculation of mel-frequency cepstral coefficients .
Thus, we can judge the presence of moans (of course, with some probability) in the sound fragment. That is, according to these two parameters, our detector classifies the fragment.
Conclusion
And it's all? Of course not. This is just an introduction. We just decided not to pile up all the technical details for different detectors, but to describe them in separate articles. Because the detectors are fundamentally different, the work on them was carried out separately, and the scope of work (and hence the scope of the description) was different.
So, to be continued:
Licenzero: simple movements
Licenzero: looking for porn by skin color