Breaking captcha

    Walking around the Internet, I went to one highly visited ancient site of Runet. In order to download a file from this site, you need to guess this captcha:
    image
    Once again, seeing a picture with numbers, I decided. For a long time thoughts crossed my mind, to break some kind of captcha :)

    I set myself the task : Write a script that will decrypt the displayed captcha and spit out precious digits.

    I don’t specifically mention the site name - you will guess :)

    So, let's go!



    We analyze the picture


    First you need to look at as many captcha as possible to identify similarities / differences, some patterns. For these purposes, I downloaded about 50 captcha. Among them, you can choose the main ones that contain the maximum of differences:

    image    image    image    image    image

    In general, I like to peer at numbers, because I devoted a lot of time to studying mathematics :)

    We consider and understand:
    • the picture is black and white, in gif format
    • the size of the picture may vary, but the numbers are always centered (although they are vertically aligned not very centered)
    • a gradient is used , its direction can change in 2 directions
    • in addition to the gradient, there is an “ angular gradient ” ( as I called it, don’t kick :) ), the one that comes from an angle at an angle of 45 ( do not kick again :) ) is just a diagonal line, in my understanding
    • In total, I identified 6 different writing fonts (more precisely 3, the other 3 are their oblique versions)
    • pixels of all digits are not darker than color # 606060, but not of the same color
    • 3-5 digits in captcha, not higher than 14px

    We are looking for a solution


    Within half an hour the options scroll in my head, one thing is clear: it is desirable to crop the picture, and since the fonts are the same and they do not change at all, you can use " prints " . By this term, I understand that the numbers we already have somewhere in the database, and we need to compare them with the picture.

    I came to this decision:
    • we start an array with prints
    • crop the picture from all sides, the excess must be thrown away
    • remove unnecessary colors - this is a gradient and an angular gradient
    • we go through all the pixels from left to right from top to bottom, and if the color of the pixel matches the color of the digit (> = # 606060), then we check with the prints, with all in order

    Implementation


    1. We prepare prints.
      In total, 6 * 10 = 60 pieces are obtained, they are placed in an array. I made fingerprints by numbers from captcha, for each font. This is just an array of lines, where in each line the letter " x " marks the pixel of the digit.

      For example, here is the number 2 of the first font:
      image
    2. Open the picture.
      This is done simply throughimagecreatefromgif($filename);

    3. Determining the direction of the gradient
      It is necessary to determine which direction the gradient is looking, this will be required in the following paragraphs.
      This is easy to do, just determine the color of the first pixel (0, 0)
      $color = imagecolorat($image, 0, 0) < 0x20 ? 'black' : 'white' ;

    4. We clean the angular gradients.
      Here you need to clean the angular gradient lines, and it is better to do this before cutting off the captcha.
      Here we just need to know the direction of the gradient in order to clean from the right side.
      By analysis, we find that the color difference from the pixel (1, 1) to (2, 2), etc. could not be greater than # 202020.
      Scrub - this means paint over with black, because All the numbers we have are not lower than the color # 606060.

      We
      image
      get the following picture: you can view the php code in the attachment (see the link below)

    5. Cut the captcha
      At this stage, cut 12px left and right.
      Because the height of the figure is not higher than 14px, then we cut off the excess from the top and bottom, depending on the height of the entire captcha.

      We get:
      image
    6. Clean the gradient
      On all sides, there are still extra stripes of the gradient. They must also be cleaned.
      We pass first from top to bottom, then from left to right, we take the color of the strip, and if it is solid (length> 10px) and of the same color, then we assume that it is a gradient strip, and we clean it.

      We get the total:
      image
      But in some cases (~ 5%), such noises can still remain:
      image    image
      True, they still will not interfere with us :) Because their color no longer matches the color of the numbers.

    7. We verify with the prints
      We go through all the pixels from top to bottom from left to right, the color of which matches the color of the numbers and compare with all the prints in order.

    results


    image

    Testing


    For testing, I downloaded 200 such captchas, on my home PC the script took them apart ~ in 19 seconds .
    This is about 10 captcha per second .

    Of these 200, not a single error was detected , the script worked fine :)

    Summary


    I wrote a class CapCrack that parses captcha.

    If you want to understand the algorithm in more detail, or test on your PC, you can take a look at the code: cap_crack.zip

    I did not stop at this success and decided to try to write a script to download files from the site in automatic mode, but this is a completely different story :) worthy of a separate article ...

    PS This is my first post on Habré, so please do not judge strictly :)

    Also popular now: