fuzzywuzzy and Invisible Scolding between humans and translator robots
Probably, in all ages there were literal translators and “freethinkers” translators. The latter felt completely free to change the text, re-write it, throw out some pieces and add others. And modern researchers are faced with a typical question: “What happened in detail during the transformation of the original? What is crossed out, what has been saved, what has been redone, what has been added? ”
Before reading the texts with all care, I wanted to try to give them to the robots for preliminary study. We had few hopes for robots, but we received substantial help from them. How we forced Google Translate and Yandex. The translator in work on the Greek text "Invisible battle", read under the cut.
“ Invisible Scolding ” is an ascetic treatise, originally written in Italian, then translated into Greek in the 18th century, edited in accordance with Eastern Christian ascetic practice, and after that translated into Russian. But also not literally, but with significant changes. Describing the principles of his work in translating into Russian, St. Theophan the Recluse wrote:
I do not translate [this book], but I freely translate my speech ... adding and subtracting and changing against the original.
A general review of editorial changes is in the work of EP. Feoktista , but I just wanted to get a detailed diff throughout the text.
For this, both texts (modern and Russian) were divided into paragraphs. It turned out about 700 paragraphs in each.
We translated the Greek text into Russian twice - once using Yandex.Translate, another time using Google Translate. Simply created large pages with full text and opened them through the corresponding web snouts. It was almost impossible to read the translated text: apparently, the original was too complicated, but something could be extracted from this horror. Keywords should have coincided somewhere, tsiferki too.
There was no special variety of tools for searching for fuzzy duplicates, they jumped at it
fuzzywuzzy, which counts the Levenshtein distance . Of the four functions:
token_set_ratio- has opted for the latter, it is not associated with word order, nor with their repetitions. And, as it turned out, the choice was correct.
For all pairs of paragraphs (Russian vs Greek), the degrees of similarity of the
token_set_ratiofeofan translation with Yandex and Google were calculated . We decided not to rely on any of them separately, but on their amount (à la two-currency basket - and this also turned out to be the right decision), and further, with eyes and handles, candidates with large values of this amount looked, and also neighboring with proven couples.
As a result, in a few hours of work, 2/3 paragraphs could be compared, only a few of the rest can be compared manually.
After the work done and the result obtained, it was interesting to go back and look again at which functions
fuzzywuzzyand which translator is best suited for such a task.
partial_ratiois too time consuming (it was 120 hours at a stretch lazily chasing his komputerku), but about an hour were counted the remaining three functions
token_set_ratiohow to yandeksovskih and for Googley translations. Total six functions of text proximity and the seventh is our “two-currency basket”.
Now you can look at the following labels. The first answers the question: “If we look for the appropriate Greek paragraph for a given Russian paragraph, examining paragraphs in descending order of similarity (calculated from this function), then what is the probability that we will see the necessary paragraph, by viewing only the first three candidates ?”
|probability of finding from three attempts
google_set_ratio + yandex_set_ratio
That is, in about 2/3 of the cases, we almost immediately stumble upon the correct paragraph. And in the remaining third of cases you will have to suffer badly. Here, take a look at the second tablet that answers the question: “How many candidates on average will you have to look through until we see the right paragraph?”
|average number of attempts
google_set_ratio + yandex_set_ratio
Looking through 40 or more paragraphs is a sad sadness, and the car in this case does not look like a reasonable prompter at all. As a result, the optimal strategy when comparing texts will be to "skim the cream", looking only at the most likely candidates, and do the rest of the comparison, based on the structure and some other factors.
Praise your intuition
It was surprising for us that the “dual currency basket” taken “from the ceiling”
google_set_ratio + yandex_set_ratioworked best, even better than each of these functions separately. In addition, the values in both tables show that, in all respects, Google Translate does this task better than Yandex.Translate. So domestic robots have room to grow.
PS In the used scripts there is no special wisdom, but if anyone needs it, we can lay out. The very result of the comparison here .
PPS If anyone is interested, the picture in the title is a fragment of a page from “The Priest-Greek-Latin Primer ” by Fedor Polikarpov-Orlov (1701).
PPPS Maybe there is a scientific journal, where this text, appropriately finished, should be proposed for publication?