Results of the competition for the restoration of documents after a shredder

    Problem number 5, about 6200 fragments, the size of each fragment is about 150 x 60 px.

    DARPA agency announced the results of the competition for the restoration of documents after the shredder. Almost 9,000 teams took part in the competition.

    Each “puzzle” consisted of fragments of handwritten text shredded on a new commercial shredder and scanned with a resolution of 400 DPI. In the most difficult task number 5 there were about 6200 fragments from an unknown number of pages - only two teams coped with this task.

    The winner was the team All Your Shreds Are Belong To US - she was able to score the maximum possible 50 points by completing all the tasks. The closest competitors scored 30 and 26 points.

    Nobody managed to develop a fully automatic solution; all the teams provided for the participation of one or more people-operators who check that the fragments match correctly. The Polish team tried to use crowdsourcing. A couple of dozen users jointly solved the first puzzle relatively quickly, but did not advance further.

    Programmer Mark Newlin ( wasabi team ), who finished third, published his document recovery methodology. All modules are developed in C # / .NET 4.0 / MSSQL. At the first stage, preparation for assembly is carried out: splitting the image into separate fragments, cleaning from the background and alignment.



    Borders are selected after the background is filled. Alignment of fragments is automated on the side with the maximum number of pixels, and in controversial cases, manual alignment helps (according to Mark, there were about 1% of such). The upper and lower boundaries of the fragments are also easily identified by the characteristic traces of the shredder, so that, if necessary, the fragment rotates 180 °. Each piece of the puzzle is saved to a file. The “cleaned” version of the fragment, cut off from the long sides, is saved separately — it is needed to find the connection points of the pen trace.

    Before assembly, a database is compiled with information about each fragment: dimensions in a "dirty" and clean form, the coordinates of the lines (if a fragment of the sheet is visible in the ruler), the shape of the border, the exit points of the pen trace, the color of each point on the border, as well as the recognized symbol. Since OCR programs do not do this well, character recognition was done manually, says Mark, with the adoption of a glass of beer after every thousand fragments.

    The proximity probability for each pair of fragments was calculated taking into account the points of contact of the track from the handle at the borders of the fragment (by the coordinates and the number of such points), by the points of contact of the rulers on the paper and the similarity of the fragments by color.

    Based on this information, the document is assembled manually in a graphical editor. Mark used GIMP and Paint.NET, but for complex puzzles of the fourth and fifth tasks with thousands of fragments, he had to make a separate interface to filter the viewing of fragments from the database by different parameters: proximity probability, pen color, presence of coffee stains, etc. .



    An interface was also added to display the most suitable fragments on the screen, which increased the accuracy and speed of the assembly.



    The general document with all the matches found was gradually supplemented, and the probabilities were recounted.



    Mark Newlin says he spent all his free time on the project over the past few weeks. He managed to solve four of the five tasks of the competition, except for the most difficult fifth puzzle of 6200 pieces, for which 24 points were given. Apparently, Mark simply did not have enough time, because he worked alone. Now he is going to buy a couple of commercial shredders in order to continue experiments and improve his technology. Perhaps in the future, Mark will write a book or open his own company to compete with Unshredder.com . Although, he will not be alone. After the DARPA contest, a large community of people interested in this topic has probably formed.

    The winning team All Your Shreds Are Belong To US also promisesreveal your solution algorithm soon. In the comments on Mark's blog post, they said they used a lot of the same methods. In the accompanying note, they reported that the solution of all the tasks took about 600 man-hours.

    The scans of solutions (PDF) sent by the winning team have been published on the DARPA website . For example, the originals and restored fragments of three pages from the fifth task are shown below. In the task, all fragments were mixed, each page contained missing fragments, and the second page was almost completely absent. To get points, it was necessary not only to assemble a puzzle, but also to decrypt the message. So, in the fifth task, the message was encoded in Morse code ( solution of each task , PDF).

    Page 1 , Morse code in the last line


    Page 2 was chopped upside down


    Page 3


    The safety standard for shredders DIN 32757 indicates the minimum fragment size after grinding for each security level:

    Level 1 = 12 mm strips or 11 x 40 mm fragments
    Level 2 = 6 mm strips or 8 x 40 mm fragments
    Level 3 = 2 mm strips or 4 x 30 mm fragments (Confidential marking)
    Level 4 = 2 x 15 mm fragments (Commercially Sensitive marking)
    Level 5 = 0.8 x 12 mm fragments(marking Top Secret or Classified)
    Level 6 = fragments 0.8 x 4 mm (marking Top Secret or Classified)

    In the fifth contest task, the fragment size is about 148 x 59 pixels, i.e. 9.4 x 3.7 mm, which is approximately corresponds to a level 4 shredder according to DIN 32757 security standard. According to Wikipedia, the CIA security standards for shredders provide for fragment sizes of not more than 1 x 5 mm, in the Russian Federation - 1 x 1 mm.

    Also popular now: