How it works: CAPTCHA

  • Tutorial
How many years Habr has existed - for so many years posts on the next captcha regularly appear on it - be it a script for generating a picture, a new idea for captcha with cats and the like. The latest example that a person does not quite understand is how captcha should work all the same (see post text and recent comments), but at the same time shares his mistakes with the community. One gets the feeling that captcha is such a terra incognita for most developers - both for those who simply screw it to the next form in the hope that it will work “out of the box”, and for those who come up with captcha like those on which you need to choose a picture of a cat from several photos.

This article contains useful information for those who use captcha on their server, instead of trusting a third-party service like reCaptcha.

And for the seed - if you think that such a captcha check will work:
if($_POST['captcha'] == $_SESSION['captcha']) return true;(an example from practice)
then you are deeply mistaken.


Captcha


According to its definition, captcha is a Turing automated public test (a test that a person can pass, but not a computer). In the article I will consider the properties of captcha on the example of its most common form - the text in the picture, although almost everything written is equally applicable to any type of captcha.

Two main properties of captcha


Any captcha should have two properties, without which it will not work:

Resistance to recognition - a property that protects the captcha from recognition by an algorithm - for example, a text recognition system. It guarantees that a person can read the text in the picture, but the computer does not.
An example: the standard phpBB 2.x forum captcha didn’t have this property - because of the relative ease of recognition, scripts appeared that spammed all forums in a row forcing webmasters to change the captcha to a more stable one.

Guessing guessing is a property of captcha that does not allow guessing its value in a small number of attempts (less than 1000). If the set of possible values ​​of the captcha is small, the program will not be difficult to guess its selection instead of recognition.
An example: arithmetic captcha like “1 + 2” (enumerating numbers from 1 to 20 will soon give a result).
An example: to choose from several pictures the one on which the cat is depicted.

CAPTCHA CHECK


The value for verification should be stored on the server, and not transmitted along with the picture to the browser. To compare the visitor and the correct value of the captcha, you need to use a certain key, which is transmitted along with the captcha (session identifier, captcha number, etc.).
Example: if you pass the captcha itself and the value for checking it (including encrypted), then it’s enough for a person once recognize such a captcha and then use the combination “response” - “value for verification” in your script (the link at the beginning of the post is just such a case)

Before checking the answer - you need to make sure that it is not empty.Otherwise, an attacker can, without uploading a picture or deleting the identifier of the current session, pass an empty value and pass the captcha, because two empty strings are compared (in PHP, a nonexistent value equals an empty string).
Anti-example: the code I already mentioned. if($_POST['captcha'] == $_SESSION['captcha']) return true;
Moreover, this code was written by an experienced programmer.

After checking, the stored captcha value must be deleted. If this is not done, the attacker will be able to use this value again an unlimited number of times. Yes, when a page with a form is refreshed, captcha is also updated (either when generating a form or when generating a picture), only the script may not load the form again (it should be noted that this is not relevant if the site uses one-time csrf tokens for forms).
Anti-example: a hypothetical login form in which it is enough to enter captcha once correctly, and then select the password with a script, avoiding the captcha regeneration on the server.

Bulletproof captcha


Overkill protection. If your captcha is resistant to recognition, but not very resistant to enumeration (for example, you need to read only 3-4 digits on it), it is advisable to limit the number of incorrect answers “from one ip” / “for one login” / etc. Such restrictions must be checked BEFORE checking the captcha itself (that is, even in the case of a correctly entered captcha, if there is a restriction it should not be considered passed) otherwise it will not interfere with the search.

DoS protection. When generating captcha on your server, you need to understand that this is a convenient vector for DoS attacks (which, unlike DDoS, any student can arrange). For protection, you can limit the number of captcha generation for one ip, caching captcha, etc. Read more about it.

Protection against recognition.If you choose a captcha, or you are suddenly going to write it yourself, it is advisable to understand which captcha is more protected from recognition. There are ready-made universal captcha recognition scripts that work on the OCR principle , and if your site is interested in spammers there is a risk that they will use / write a script specifically for your captcha. The latter truth applies more to sites of the Yandex or vk level, but it is advisable to provide an option with protection against commonplace OCR.

Protection against anti-gates. Speaking formally, a captcha as a Turing test is not required to protect you from anti-gates, since in this case a person will recognize it. From a practical point of view, this issue is very relevant and it is somehow necessary to defend oneself.
There is no and cannot be a “gold standard” (because in this case the anti-gates will implement its support), therefore you are free to supplement the captcha with any tricks to make its recognition through the anti-gate impossible. For example:
- non-standard captcha (collecting a puzzle, rotating the image, clicking on an area in the photo, etc.);
- Cyrillic captcha - the simplest solution, but has several disadvantages: it is suitable only for projects with a Russian-speaking audience, there are anti-gates with support for the Cyrillic alphabet;
- the use of a virtual keyboard next to the captcha for entering non-standard characters or figures (it may be inconvenient for mobile users);

Usability


Do not ask for captcha if you are already convinced that you have a person in front of you. Here, however, one must be careful that the form cannot be used by the script an unlimited number of times after a person has entered the captcha once.
Example: registration form. If I register somewhere, and forgot to enter the “zip code” field, but entered the captcha correctly - no need to show me a new one. Spend 10 minutes to save somewhere at home that a living person is trying to fill out this particular form now.

To facilitate recognition by a person: do not use letters and numbers in the captcha at the same time, do not use both uppercase and lowercase letters, exclude similar characters.

Refusal to use captcha


The best captcha is the lack of captcha. Where you can refuse to use it - this must be done. You may need to implement additional limits and checks, but users will thank you.
But here you have to be very careful. For example: a registration form without captcha, with an email field to which an activation letter arrives. Without additional means of protection, such a form can be overwhelmed with "left" addresses, and your site will be included in the black lists of postal services. In this case, you can do without captcha, but only if you have a different line of protection, such as the ip limit.

To some, the information in this topic will seem obvious, but if I had not encountered examples of misunderstanding of these simple principles in life, including from experienced fellow developers, I would not have spent time writing this text.

Also popular now: