“All Tolstoy in one click”: how we did it / ABBYY Blog

luciana August 4, 2015 at 12:37

“All Tolstoy in one click”: how we did it

Some time ago, we organized the digitization of the 90-volume collected works of Leo Tolstoy, more than 3 thousand volunteers helped us with this. There were many publications on this crowdsourcing project , but not one of them dealt with the technical part - it will be discussed in this article.

So, we were faced with the task of translating into the formats of electronic books (ePub, fb2, html, mobi), as well as in PDF with a text layer, the most complete collected works of Tolstoy. It was produced for 30 years: from 1928 to 1958, each volume came out with a circulation of 5 thousand copies. Prior to the release of the electronic publication, this collected work was not reprinted and has already become difficult to access rarities. The 90-volume includes: works of art (1–45 volumes), diaries and notebooks (46–58 volumes), letters (59–90 volumes). There was also a secret 91st volume, which consisted entirely of indexes and therefore gave our editors many ~~sleepless nights~~ to show professionalism. Of course, many creations of the classic existed in electronic form before, but not all.

Digitizing anything nowadays is not a problem when the right technology is at hand, but subtracting such large volumes of text and correcting any inaccuracies in recognition is a huge job that requires either an unlimited resource of time ~~(about eternity)~~ or a lot of helpers. Therefore, we, together with the main customer of the digitization project - the State Museum of Leo Tolstoy - decided to do a crowdsourcing project and involve volunteers in the proofreading. For convenience, a website was created - www.readingtolstoy.ru .

The collected works were scanned by the Russian State Library in 2006, and we got PDF files (only images, without a text layer) for work, one volume (and this is from 400 to 600 pages) - one file. Files together occupied only 4 GB.

Since the volunteers had to verify the texts, we decided to divide the files into small parts (“packages”) - so that the work does not seem complicated and time-consuming for people, so that it is interesting and not boring. It seemed to us that a package of 20 pages fully satisfies these conditions. So, all PDF files were automatically “cut” into parts using ABBYY Recognition Server, from each volume about 20 files came out - depending on the initial number of pages, the format remained as PDF. We were not guided by the division of volumes by any conditions other than the number of pages — for example, the end of one work and the beginning of another could fall into one package.

Further, the resulting packages had to be recognized - this was done by our employee using ABBYY FineReader(version 11 was used). Typically, document recognition consists of several stages. First, you scan a document (or open a finished scan in the program, as was the case in our case), then the program analyzes the document and marks out areas - images (they are not recognized, i.e. the text is not extracted from them), text, tables, footnotes. Then the program recognizes everything that it should recognize, then we have the opportunity to check whether everything turned out correctly (compare the scan with the recognition result).

So, our employee “ran” scans through FineReader and worked with marking areas (volunteers had to verify the recognition). Here the most ~~difficult~~interesting. We needed to analyze all volumes and decide what needs to be recognized - as a text, table, footnote or footer, and what - to leave an image - and correct the markup in accordance with this. We decided that we would leave the images as covers, actually pictures, formulas, handwritten notes and Tolstoy's drawings.

The cover of one of the volumes (standard layout of FineReader: the “image” area is highlighted in red, the “text” is green, the “table” is purple)

Tolstoy's handwritten notes

In some works, for example, in the ABC, there were a lot of pictures and very few text - we decided that most of the content on the pages would be left with images. So I automatically marked the areas of FineReader:

And it was so convenient for us:

In the output of some volumes, some surnames are surrounded by frames - such places were also marked as images. For further work with texts, it was convenient for page numbers to be marked with a footer area. In one of Tolstoy's volumes, excerpts from The Tale of Bygone Years and other works in Old Russian are given. FineReader does not recognize this language, so we initially prepared a table where such fragments are defined as images.

Marked in this way and recognized pages were saved in the native format of the FineReader document (or package). Such a document represents a folder containing a bunch of files. So that the volunteers could download the package in one file from the site, the document was archived in zip. When the packages were ready, they were posted on a specially created project site, from where the volunteers could download them for verification. Briefly about how the site was made, those interested can read under the spoiler.

Crowdsourcing platform

Crowdsourcing platform

It was necessary to create a platform for joint work of a large number of people (volunteers) in a very short time - we had only about a month to develop the platform itself.

The platform was written in Ruby in conjunction with the MySQL DBMS; the BitBucket system was used as the repository and development management. Components of the platform:

1. information part (consists of static pages about the project, news, FAQ, etc.)
2. application (manages users, books, packages and processes)
3. file storage in the source, as well as in all intermediate states fragments of books.

For reliable operation of the entire project as a whole, the Amazon cloud-based architecture with scalability was used.

According to the results of the project, such technical statistics were collected:

• peak load - 6 requests per second, on average 2-3
• peak - 9600 unique visitors in the first week of the project, 3000 on the third day (June 20)
• maximum attendance 12.00-18.00, minimum 4-6 in the morning.

The mechanics of the process looked like this: the volunteer registered on the website www.readingtolstoy.ru , went into his personal account, where he could take one package of 20 pages for verification. Packages were issued to users in the order they appear in the volume — so that entire volumes are collected faster.

All participants received a license to ABBYY FineReader 11 Professional Edition valid until the end of 2013. The program has already been configured recognition languages that are found in Tolstoy - Old Russian spelling, English, French, German, Greek, etc.

The volunteers were assigned two tasks. The first is to verify the correct layout of areas. The attentive reader will say - after all, this has already been done at the last stage. But when recognizing the correct marking of areas is about half the success, so the volunteers also had to make sure that the document was marked up correctly. The second is to check inaccurately recognized characters, compare the recognition result with the original and correct errors. There were two types of errors: incorrectly recognized characters in the text (where the quality of the scan was poor) and in the arrangement of paragraphs - paragraphs were sometimes glued together or, conversely, were broken where it was not necessary.

Also, people had to adjust the page layout - in case of transferring a word from one page to another, it was necessary to “glue” the word and leave it entirely on one of the pages. Detailed instructions were given to help the volunteers.

The package had to be checked and returned to the site within 48 hours. As we recall, the participant downloaded the archived file and in the same form should upload it back to the site. If the packet was not returned, it was sent out a second time. Points were awarded for checked packages, the most active participants received prizes - Onyx e-books, ABBYY FineReader programs and other gifts. And the main characters went on a two-day excursion to the Yasnaya Polyana museum-estate, where they could personally chat with the great-great-granddaughter of the writer Fyokla Tolstoy and other project organizers.

In truth, we didn’t think that our initiative would receive such an active response among Tolstoy’s readers, but people started registering already during a press conference on the opening of the project’s website, and the entire collected works were checked in just two weeks.

The first phase of the project attracted 1,600 participants.

When we started checking packages, the quality of work turned out to be heterogeneous. Most of the volunteers approached the matter responsibly, but there were mistakes. After checking most of the packages, the second round began - checking the same packages by the so-called “auditors”.

Auditors could be both participants in the first round, who did a good job, and new volunteers. All applicants had to undergo testing, which included questions related to the verification of texts. The auditors checked the finished packages, corrected the errors and gave the participants of the first round additional ratings, which the organizers then drew attention to.

After that, the packages arrived in a special database on the site. When all packages from one volume were ready, the project administrator saw this, downloaded all volume packages from the site and collected them back into a single document (still in FineReader format) using a special utility written by our developers. Then our employee checked to see if he had assembled correctly, whether page numbering, etc., was broken. After that, the finished volume was transferred back to the administrator.

Although the quality of the work of the auditors was beyond praise, we still wanted to play it safe and arranged a third round of text verification - this time in whole volumes. Among the volunteers, we ourselves selected 30 people who worked well in the early stages — they became “editors,” in addition, at this stage a small number of new volunteers — linguists and professional editors — joined us.

The editor could take the whole volume only, one week was given for verification, after which a person had to upload the document back to the site. If the editor did not have time to check the entire volume, he indicated the number of verified pages and uploaded the volume to the site. Volunteers worked so well on this project round that they even found factual mistakes made in the paper edition - for example, the initials of one of the editors were incorrectly indicated in the output of one of the volumes.

After the third stage of verification, the administrator exported volumes to MS Word format and they were sent for verification to our staff editors. The editors read the files again, corrections were made both in the Word-file and in the FineReader source package (to facilitate subsequent saving from it to other formats).

As a result of the project, we needed to get files of the following types:

1. PDF with a text layer
2. Html, as well as files of the FB2, epub, mobi formats for e-books (at this stage, our partners from WEXLER, who were engaged in converting, joined the work the files we received in e-book formats, for more details about this work, see the article by the head of the WEXLER software development center Sattar Gulmamedov.

Well, a little about the results. 3249 volunteers from 49 countries took part in the project. Altogether, according to the results of the work, 670 books were obtained, of which 91 are identical to the volumes of the original collected works and 579 works “extracted” from the volumes. There are 2084 files in total. For the 91st volume, only the html version was made, since this index will not be interesting in the form of an electronic book, and for 9 works they did not make the fb2 version due to some format limitations.

All e-books are available on the official portal dedicated to Tolstoy. And on the project website www.readingtolstoy.ruNow an interactive map has been posted, where everyone who has downloaded the work of Leo Tolstoy can mark himself - as a result, interesting statistics are obtained for the most popular works among users and for countries and regions with the most active readers.

Of course, the main goal of digitizing Tolstoy’s collected works is to provide access to the writer's legacy to all readers, but the benefits do not end there. The texts of Tolstoy in electronic form are of great interest to linguistic researchers. We hope to tell you about one of these studies in one of the following articles.