War, Peace, and ABBYY Compreno: A Continuation of Our Affair with Tolstoy

    Recently, we talked here about how the project "All Tolstoy in one click" was done . Using 3249 (three thousand two hundred forty-nine) volunteers and 1 (one) good OCR technology, we digitized 46,820 pages of the 90-volume collected works of the writer, carefully read them and put them in the public domain .

    But if you thought that our “romance with Tolstoy” ended there, then you were mistaken - having digitized the writer's texts, we began to study them using ABBYY Compreno information extraction technology - not to be lost in such rich material. Read about what “text mining of Tolstoy” gave us and where the results are now used.


    The main goal of the project “All Tolstoy in one click” was to make Tolstoy’s work truly public, so that all texts that came out from under his pen were accessible in one click anywhere in the world. As, incidentally, the author himself bequeathed, who, during his lifetime, had given up all rights to his texts (yes, anonymous, Leo Tolstoy knew about copyleft and openend long before these of your Internet and Richard Stallman).

    However, the ability to upload a book in a convenient format to a reader or tablet is not the only plus of digitization. Now Tolstoy’s texts can be not only read, but also “measured,” that is, investigated using different quantitative methods, using the entire arsenal of automatic text processing tools (AOT, also known as NLP). After all, if you have all the writer's texts in electronic form, even with the help of one or two competent search queries, you can get interesting data that some literary critic could spend weeks and months of hard work at other times. And if you also have an advanced natural language analysis technology, that is, the chances of making a serious philological discovery (even without being a philologist). Below I will tell you what we were able to measure and learn, but before that - a few words about who, how and why is involved in the automatic processing of literary texts and what is interesting in this case.

    Lyrical digression: Distant Reading and Computational Philology

    In 2010, Google counted 130 million books in the world , and these statistics were attributed to "at least until Sunday." Today, they are probably several million more. This in itself is not a problem - and so it is clear that reading “everything about everything” is a bad idea unless you are a teenager of 12 years old, eagerly absorbing in an encyclopedia per week . Worse, from a certain moment even the list of books within one narrow topic or, for example, one literary direction becomes unbearable. For example, Victorian England alone produced over 60,000 works of art. There is hardly even a scholar who purposefully studies the literature of that era and has mastered at least a percent of this collection.

    A possible (albeit controversial) solution to this problem was one of the first to be offered by the shocking critic and former neo-Marxist Franco Moretti, who now heads the Stanford Literary Lab. He stated that literary scholars today should "stop reading books and start reading, mapping, and visualizing them." To regular reading (close reading) Moretti contrasts reading “distant” (distant reading), that is, automatic analysis of text corps, statistics, graphing, etc. In his opinion, this is the only way we can make literary criticism “objective” and avoid conclusions drawn from the “ridiculously small” sample. Research results from the Stanford Literary Lab, carried out in the spirit of "distant reading", can be viewed here .

    Remote Reading with Compreno

    Researchers at Stanford mostly use the simplest statistics - for example, the frequency of words and N-grams and their distribution throughout the text. From the very beginning, we decided to study such aspects of a literary text that cannot be pulled out with simple Ctrl + F. For example, the speech activity of the heroes: try to immediately count how many times Natasha Rostova (or any other character) says something. Quite quickly, you will realize that for this, first of all, it would be nice to be able to automatically resolve the pronoun anaphora (for examples like Natasha began to put on a dress. - Now, now, don’t go, dad,” she shouted to her father “), secondly, somehow limit the set of verbs that can express the fact of "speaking" (and they are quite diverse), and thirdly, have at least automatic morphology, and better syntax (as the word order free, and it’s not so easy to find the speaker in examples such as “ He never blessed his children and only, turning her bristly cheek, still unshaven now, said , sternly and at the same time carefully, gently looking at her:“ Is he healthy? .. well, so sit down! ”).

    Fortunately, all this is already "wired" in Compreno. The syntactic-semantic trees that the parser issues contain all the necessary information about who said what and how, they have already removed syntactic and lexical homonymy and resolved pronoun anaphora. For example, in such a fragment “ Really? - exclaimed Anna Mikhailovna. - Ah, this is terrible! It’s scary to think ... This is my son, she added, pointing to Boris. “He himself wanted to thank you, ” you need to understand who she is and to correctly define the semantic class of the multi-valued verb add. Compreno copes with both tasks - this is how the subtree looks for “ she added, pointing to Boris ”:

    To get from such trees mentions of characters and the necessary information about them allows our mechanism for extracting information, which we have repeatedly described here from different angles ( one , two ). Thanks to reliance on deep syntax and semantic hierarchy, we can cover a large class of cases with 1-2 wood patterns. For example, a rule that is looking for such a structure: it

    will work on such different examples as:

    - Do you want to kiss me? - she whispered almost inaudibly, looking askance at him, smiling, and almost crying with excitement.
    Denisov, don’t joke with this, ”Rostov shouted .“ It’s such a high, so beautiful feeling, such a ...
    Hush, hush, isn’t it hush? - Apparently more suffering than a dying soldier, the sovereign said, and drove away.
    Aunt coughed, swallowed and French said that she was very pleased to see Helene;

    In addition to speech activity, we investigated some other aspects of the behavior of Tolstoy's heroes. Below I will talk about what we managed to find out.

    Impulsive Natasha Rostova and calm Andrei Bolkonsky: what we managed to understand using Compreno

    To begin with, we simply calculated how many times each character of “War and Peace” makes a statement, and made up the top most “talkative” characters in absolute numbers. He is unlikely to surprise those who are familiar with the content of the novel:

    Here, the frequency, apparently, is nothing more than an indicator of the "centrality" of the character.

    If we normalize the received numbers to the total number of references in the text (having previously removed too low-frequency heroes), our top changes noticeably:

    Now at the top Petya Rostov is an emotional and talkative child in the first volume, a young enthusiastic romantic teenager in the fourth (up to his own death). Next are three female characters - Princess Mary, quiet, modest and exhausted by a strict father, whom we learn mainly from conversations with other characters and an internal monologue, Natasha Rostova, an immediate and lively heroine whose reader hears replicas throughout the novel (in the first she is 13 years old, in the epilogue - 29), and Anna Drubetskaya, an active intriguer, able to blab out any person she needs into submission.

    Here I must say that Tolstoy considered it important to equip each character with his own style of speech - this was part of his creative method. He explained even his well-known dislike for Shakespeare (“Shakespeare's works recognized by the whole world for brilliant artworks <...> were disgusting to me”) precisely because allegedly “Shakespeare lacks the main, if not the only means of portraying characters,“ language ”, then it is that each person speaks his own language characteristic of his character. ” Therefore, at the next stage, we tried to highlight some significant parameters by which the speech of the characters can stably vary.

    The first obvious parameter is the number of exclamatory and interrogative sentences. By the ratio of questions, exclamations and all other (conditionally neutral) speech, one can already understand quite a lot about the character. Compare the three young Rostovs, Andrei Bolkonsky and Pierre Bezukhov. The predictable exclamation champion is the youngest of the Rostovs, Petya:

    Natasha is older than Petya and shows a little more restraint, but still remains very emotional, only a third of her speech is conditionally “neutral”: the

    elder brother of Petya and Natasha, Nikolai exclaims and asks even less, but before half the proportion of neutral speech falls short - like all Rostovs, he is also very emotional:

    Another thing is Prince Andrei Bolkonsky, impeccably seasoned, proud, belonging to secular society with cold contempt and showing emotions only in the circle of close people (it was not for nothing that the strong-willed handsome Vyacheslav "Stirlitz" Tikhonov played him in the Oscar-winning film adaptation). Bolkonsky exclaims very little, and he asks relatively little:

    Pierre Bezukhov is perhaps the most reflective character in the novel. He is clearly more emotional than Andrei Bolkonsky, but not in the direction of “exclamations”, like the whole Rostov family. Pierre rarely exclaims, but he asks almost as often as the very childish direct Petya and Natasha:

    Also, with the help of Compreno, you can easily get the characteristic that Tolstoy gives the very act of speaking, and this can also act as a kind of parameter. Most often, such a characteristic is expressed in the form of a participle attached to the verb of speaking ( Pierre screamed, striking with a decisive and drunken gesture on the table ) or additions in the instrumental case with the preposition c ( prince Vasily asked with even greater twitching of his cheeks than before ). For example, the speech of the rich, important and self-serving Prince Vasily Kuragin more often than other characters is accompanied by participles in which either his appearance is characteristic ( rubbing his bald head, straightening a frill ), or hidden intentions, character traits, soul movements (saying things that he didn’t even want to believe, angrily moving the moved table to himself ); Anna Mikhailovna Drubetskaya, forever creeping in front of the heroes from whom she needs something, often says “smiling” or “with a smile”; in a phlegmatic, constantly sleepy Kutuzov, speaking is often accompanied by a movement of the head: he either nods to her or lowers her.

    Sensitive Marya Bolkonskaya and intrigues around Pierre’s inheritance: the deep syntax of “War and Peace”

    In our next micro-study, we decided not to limit ourselves to speech activity and to consider all situations of the “activity” of the hero in the text. To do this, we collected statistics on the deep positions in which characters fall under various predicates. The deep positions in Compreno trees are similar to semantic roles : for example, if a hero performs an active action (speaks, walks, shoots, hits), he falls into the position of an agent; if it appears in the role of a passive object of external influence (it is scolded, driven, beaten, praised, loved), it falls into the position of the object (Object), if it perceives, sees, hears, feels, or, for example, loves something, it becomes an experimenter (Experiencer); if she is the addressee of the message ( she told Pierre), falls into the position of the addressee. There are other positions (there are about 500 of them in our model), but here we use only a few of the most common ones that may appear under the predicate.

    It is important that the deep positions reflect precisely the semantic roles of the participant in the speech situation and do not depend directly on the specific implementation in the sentence. So, in phrases Pierre loved Natasha and Natasha was loved by Pierre Pierre will be an experimenter, and Natasha - an object, regardless of collateral.

    It turned out that statistics on deep positions allows you to get some information about the differences in the characters' characters and gives “objective” confirmation to those images that are formed by the reader when they get acquainted with the novel. Let's look at the diagram where the fractions of the selected depth positions for the main characters in the first volume of “War and Peace” are presented:

    In general, the frequency distributions look similar, and quite predictably the most frequent position for all the heroes turned out to be agent. However, the spread here is quite large - from 40.7% for Princess Marya and 44.6% for Boris Drubetskoy to 68.3% for Anna Drubetskaya. These three "extreme" characters are of interest.

    Princess Mary is notable, first of all, for the abnormally high frequency of getting into the position of an experimentalist. In combination with the low frequency of agent use, this gives us a portrait of the character that is sensitive, but has little effect, which for the first volume is completely true. Andrei Bolkonsky’s sister, along with his father, an old, pedantic and strict to the tyranny Catherine’s general, “lives without a break” on the estate in Lysy Gory, spending time in correspondence with his brother and girlfriend Julia, communicating with pilgrims and doing algebra and geometry, which suits her old the prince. It appears in the reader’s field of view solely in connection with the visits of other heroes to the Bald Mountains. Literary scholars believe that the image of Princess Marya was created by Tolstoy under the strong influence of sentimentalism of the XVIII century.

    The championship of Anna Drubetskaya in the share of agent use is also easily explained by the plot of the first volume. This middle-aged lady of a noble family name, but very modest, at the beginning of the novel develops vigorous activity, the ultimate goal of which is the well-being and promotion of her only son Boris. She is described as “one of those women, especially mothers, who, once having taken something in their head, will not lag behind until they fulfill their wishes, and otherwise are ready for daily, minute harassment, and even on stage” . First, Anna Mikhailovna besieged the wealthy and influential Prince Vasily, seeking the transfer of her son to the guard, then successfully intrigues him for the legacy of Count Bezukhov, while simultaneously extracting money from the Rostovs in order to “uniform Boris”.

    Boris himself has not yet become as cynical, dexterous and greedy as his mother - this will happen in the following volumes. He does not want to step over his own pride, and therefore resists the requests of Anna Mikhailovna to be “sweet”, “affectionate” and “attentive” during visits to important people and is extremely reluctant to participate in her troubles, acting as a passive object. Boris's passivity is reflected in our graph with a large proportion of the object's deep position.

    Natasha’s “thick neck” in your smartphone: we revive War and Peace

    Attempts to “count” literature often cause criticism in the spirit that, they say, authors try to measure the immeasurable and thereby vulgarize and emasculate the imperishable work of the classic. Interestingly, such accusations were made 100 years ago, when there was no mention of any distant reading. “It was believed that to study the work itself was to anatomize it, and for this, as you know, you must first kill a living creature. We were constantly accused of this crime, ”wrote Boris Eichenbaum, one of the largest representatives of the formal method in literary criticism in 1921 (and the formalists of those times were something like people who invented distant reading in theory long before the invention of the computer and were unable to try him in practice).

    So that we would not be accused of “killing” the novel, we decided to do the exact opposite thing, that is, its revival. To this end, we, together with colleagues from the Higher School of Economics, joined the development of the Samsung Live Pages mobile application, which now uses the results of the information extraction system based on ABBYY Compreno.

    The “Live Pages” application implements several non-standard scenarios for exploring works of art and their characters - timelines with events and fates, cards and character quotes, interactive maps with binding places to episodes of the novel.

    All this is based on infographics, made in a game style and, as it seems to us, has more chances to hook a tenth grader-gadgetman with ADD than the thick volume that a school librarian will entrust him with.

    In addition to the speeches of the heroes for the quotation book, Сompreno was used to extract dates for timelines, locations for maps, and also epithets - various characteristics that Tolstoy loved to reward his characters with. Everyone, of course, remembers the antennae of the little princess, the wife of Bolkonsky, but how many people thought that the most brilliant handsome Andrei had “little chubby hands” (and this, combined with a small stature), and the graceful thin Natasha Rostova had a “thick neck” and the "big mouth"?

    Everyone can download the application and make many more discoveries in the same spirit. Meanwhile, we will return to our studies and continue to “anatomize” the texts with the help of Compreno, look for new unexpected things in them and reveal the mysterious “Tolstoy code” that made his works immortal.

    Also popular now: