How does reCAPTCHA work?

    In a discussion of my recent translation of a wonderful article about CAPTCHA, several questions arose about reCAPTCHA, namely how this system works. Under the cut, I will outline the essence of reCAPTCHA in general terms, clearly show how it works and how it numbers books.


    I’ll tell you everything briefly enough, but it’s understandable. The above illustrations were taken from the reCAPTCHA official website.

    Stop spam


    In essence, reCAPTCHA performs the same function that other captchas perform. The bottom line is simple, we introduce the proposed text and thereby prove that we are not a robot. The main difference from other systems is that reCAPTCHA not only protects the site from spammers, but also performs another, quite interesting function.

    Read books


    As you probably noticed, reCAPTCHA offers to introduce two words, which is practically not found in other captchas. The bottom line is that the user, when entering these words, not only proves that he is a person, but also helps to recognize old books and newspapers.

    The principle of operation is simple:
    Suppose there is an nth book that has been preserved in a small number of copies, while all of them are in poor condition. One scanned copy fell into the hands of Google (owner of reCAPTCHA). What to do with him? That's right, to digitize (and the point is not only in preserving the heritage, but more on that later). How to digitize? Digit using character recognition systems (OCR). But, as many people know, these systems very often sin with numerous errors in the issued result. Manually sorting through all the text for errors is too expensive a pleasure. And so, reCAPTCHA comes to the rescue. One word in the image was recognized correctly by the OCR system, but the second is not at all. The second word is for the user, exactly what he enters will be used as a replacement for the erroneous option proposed by OCR. Surely some will grin now, yes, I know that in fact, instead of the second word, you can enter anything. But each reCAPTCHA word incomprehensible to OCR shows users hundreds or even thousands of times (with a figure of 200 million generations a day, this is very small), and in the end, the option that users entered most often is considered correct.

    From the boring text, let's move on to the illustrations:

    This is what the scanned text looks like. Quality can be said to be on top, but let's take a look at the result of OCR:

    Errors are highlighted in red. Isn't it too many of them? Now let's see what will be the result of reCAPTCHA:

    You don’t have to be seven spans in your forehead to see the difference between OCR and the duo OCR + reCAPTCHA. The digitization is 100% error-free.

    Of course, this is something like an ideal situation where everything develops as conceived by the creators of reCAPTCHA. But surely many of you have come across completely unreadable words offered for input. The problem is that some books / newspapers have been preserved so poorly that at times they are manually recognized disgustingly. Here is an example:

    Image disgusting quality. Let's see what OCR can do here ...

    ... but nothing. Errors are not highlighted because all this is one big mistake.

    But with reCAPTCHA, the result is quite readable, albeit not infallible.

    This is how users help digitalize books with reCAPTCHA. In my opinion, this is great.

    I did not understand anything!


    In short: the image generated by reCAPTCHA consists of two scanned words. One is already known to the system; there are doubts about the second. It is this second word that is the object for recognition by the forces of users. Roughly speaking, the reCAPTCHA interface might look like this:


    Recognition Scripts


    There is a misconception that reCAPTCHA cannot be hacked (we are talking about automatic recognition of the cited text, without human intervention). However, judging by the trends, this is not so. Over time, reCAPTCHA has made various pitfalls for recognition systems. Among them, the curvature of the text, the intersection of it with stripes, a feature was also recently introduced, thanks to which the verification word (known to the system) looks doubled. All this indicates that reCAPTCHA is still experiencing some difficulties with protection.

    No one suspected


    There are people who criticize reCAPTCHA, and from an ethical point of view, they criticize for good reason. The fact is that Google somehow receives money for the recognized text. And the texts themselves are mined for free, by users. That is, there is free labor. Personally, I do not care, besides, no one forces users to enter reCAPTCHA, and moreover, no one forces webmasters to install it on their sites :)

    Irony


    Surely some of you, having read the previous paragraph, realized that something was wrong here. Everyone knows about the services for manual recognition of captcha, where millions of Asians introduce captcha for pennies. So, if you take into account the previous paragraph, it turns out that these Asians work not only on the recognition service, they work on Google. Is free.

    Also popular now: