Translation difficulties: how to find plagiarism from English in Russian scientific articles

  • Tutorial
In our first article in the corporate blog of the Anti-Plagiarism company on Habr, I decided to talk about how the search for transfer loan algorithm works. A few years ago the idea came up to make a tool for detecting translated and borrowed text from the original in English in Russian texts. At the same time, it is important that this tool can work with a source database of billions of texts and withstand the usual peak load of the Anti-Plagiarism (200-300 texts per minute).

"

Over the 12 years of its operation, the Anti-Plagiarism service has detected borrowings in the same language. That is, if the user uploaded the text in Russian for verification, then we searched in Russian sources, if in English, then in English, etc. In this article I will talk about the algorithm that we developed to detect translated plagiarism, and what cases of translated plagiarism could be found by testing this solution on the basis of Russian-language scientific articles.

I want to dot all the “i”: in the article we will focus only on those manifestations of plagiarism that are associated with the use of someone else’s text. Everything related to the theft of other people's inventions, ideas, thoughts, will remain outside the scope of the article. In cases where we do not know how legitimate, correct or ethical this use was, we will say “text borrowing” or “text borrowing”. We use the word “plagiarism” only when an attempt to pass off someone else’s text as one is obvious and not in doubt.

We worked on this article with Rita_Kuznetsova and Oleg_Bakhteev . We decided that the images of Pinocchio and Pinocchio serve as an excellent illustration of the problem of searching for plagiarism from foreign sources. Immediately make a reservation that we are in no way blamingA.N. Tolstoy in plagiarism of ideas Carlo Collodi .

To begin with, I will briefly tell you how “ordinary Anti-Plagiarism” works. We built our solution on the basis of the so-called “Shingles algorithm”, which allows you to quickly find borrowings in very large collections of documents. This algorithm is based on dividing the text of the document into small overlapping sequences of words of a certain length - shingles. Commonly used are shingles from 4 to 6 words long. For each shingle, the value of the hash function is calculated . The search index is formed as a sorted list of hash function values ​​indicating the identifiers of the documents in which the corresponding shingles met.

The document being checked is also split into shingles. Then, by index, there are documents with the greatest number of matches in scingles with the document being checked.
This algorithm has successfully established itself in the search for borrowings in both English and Russian. The search algorithm for shingles allows you to quickly find borrowed fragments, while it allows you to search not only for fully copied text, but also for borrowings with minor changes. More information about the problem of detecting fuzzy text duplicates and methods for solving it can be found, for example, from an article by Yu. Zelenkov and I. Segalovich .

As the search engine developed, “almost duplicates” became insufficient. Many authors had a need to quickly increase the percentage of originality of a document, or, to put it another way, in one way or another “deceive” the current algorithm and get a higher percentage of originality. Naturally, the most effective way that comes to mind is to rewrite the text in other words, that is, to rephrase it. However, the main drawback of this method is that it takes too much time to implement. Therefore, you need something simpler, but guaranteed to bring results.

Here comes to mind the borrowing from foreign sources. The rapid growth of modern technology and the success of machine translation make it possible to get an original work that, if you glance at it, looks like it was written by yourself (if you do not read carefully and do not look for machine translation errors, which, however, are easy to fix).

Until recently, it was possible to detect this type of plagiarism only with wide knowledge on the subject of work. An automatic tool for detecting loans of this kind did not exist. This is well illustrated by the case with the article “Rooter: An Algorithm for the Typical Unification of Access Points and Redundancy” . In fact, Rooter is a translation of an automatically generated article."Rooter: A Methodology for the Typical Unification of Access Points and Redundancy . " The precedent was created artificially in order to illustrate problems in the structure of journals from the list of the Higher Attestation Commission in particular and in the state of Russian science as a whole.

Alas, the translated work would not have been found by “ordinary Anti-Plagiarism” - firstly, the search is carried out according to the Russian-language collection, and secondly, a different algorithm is needed to search for such borrowings.





General algorithm scheme


Obviously, if texts are borrowed by translation, it is mainly from English-language articles. And this happens for several reasons:

  • an incredible amount of all kinds of texts are written in English;
  • In most cases, Russian scientists use English as the second “working" language;
  • English is the generally accepted working language for most international scientific conferences and journals.

Based on this, we decided to develop solutions for finding borrowings from English into Russian. As a result, we got such a general scheme of the algorithm:

  1. The Russian-language checked document arrives at the entrance.
  2. The machine translates the Russian text into English.
  3. Search for candidates for sources of borrowing by indexed collection of English-language documents.
  4. Each candidate found is compared with the English version of the document being checked - the definition of the boundaries of the borrowed fragments.
  5. Borders of fragments are transferred to the Russian version of the document. At the end of the process, a verification report is generated.

Step one. Machine translation and its ambiguity


The first task that needs to be solved after the appearance of the document being checked is the translation of the text into English. In order not to depend on third-party tools, we decided to use ready-made algorithmic solutions from open access and train them ourselves. To do this, it was necessary to collect parallel text bodies for a pair of languages ​​“English - Russian”, which are in the public domain, and also try to assemble such cases on their own by analyzing the web pages of bilingual sites. Of course, the quality of the translator we trained is inferior to leading solutions, but after all, no one demands high quality translation from us. As a result, we managed to collect about 20 million pairs of scientific proposals. Such a sample was suitable for solving the problem before us.

Having implemented a machine translator, we faced the first difficulty - the translation is always ambiguous. One and the same meaning can be expressed in different words, the structure of a sentence and the order of words can change. And since translation is done automatically, machine translation errors are also superimposed here.

To illustrate this ambiguity, we took the first preprint from arxiv.org



and selected a small piece of text that we proposed to translate to two colleagues with good knowledge of English and two well-known machine translation services.



After analyzing the results, we were very surprised. Below you can see how different the translations turned out, although the general meaning of the fragment remained:



We assume that the text that we automatically translated from Russian into English in the first step of our algorithm could previously be translated from English into Russian. Naturally, in what way the initial translation was carried out, we do not know. But even if we knew this, the chances of getting exactly the source text would be negligible.

Here you can draw a parallel with the mathematical model of the “noisy channel model”) Suppose a certain text in English went through a “channel with noise” and became a text in Russian, which, in turn, went through another “channel with noise” (naturally, it was another channel) and was output English text that differs from the original. The imposition of such a double “noise” is one of the main problems of the task.



Step Two From exact matches to “by sense” searches


It became obvious that, even with the translated text, it would be impossible to correctly find borrowings in it, searching through a collection of sources consisting of many millions of documents, ensuring sufficient completeness, accuracy and speed of search using the traditional shingles algorithm.

And here we decided to get away from the old search scheme based on word matching. We definitely needed a different algorithm for detecting borrowings, which, on the one hand, could match fragments of texts “in meaning”, and on the other hand, remained as fast as the shingles algorithm.

But what to do with noise, which gives us a “double” machine translation in texts? Will texts generated by different translators be detected, as in the example below?



We decided to provide a search “by meaning” through the clustering of English words so that semantically related words and word forms of the same word fell into one cluster. For example, the word “beer” will fall into a cluster that also contains the following words:

[beer, beers, brewing, ale, brew, brewery, pint, stout, guinness, ipa, brewed, lager, ales, brews, pints, cask]

Now Before breaking the texts into shingles, it is necessary to replace the words with the labels of the classes to which these words belong. Moreover, due to the fact that the shingles are built with overlap, you can ignore certain inaccuracies inherent in clustering algorithms.

Despite the errors of clustering, the search for candidate documents is carried out with sufficient completeness - it is enough for us to match only a few shingles, and still at a high speed.

Step Three Of all the candidates, the most worthy should win.


So, candidate documents for the presence of transferable borrowings have been found, and we can begin to “meaningfully” compare the text of each candidate with the checked text. Here the shingles will no longer help us - this tool is too inaccurate to solve this problem. We will try to implement such an idea: we associate each piece of text with a point in a space of very large dimension, while we strive to ensure that fragments of texts that are close in meaning are represented by points located in this space nearby (be close in some distance function )

We will calculate the coordinates of the point (or a little more scientifically - the components of the vector) for a fragment of text using a neural network, and we will train this network using data labeled by assessors. The assessor’s role in this work is to create a training set, that is, to indicate for some pairs of fragments of the text whether they are close in meaning or not. Naturally, the more it is possible to collect marked-up fragments, the better the trained network will work.

The key task in all the work is to choose the right architecture and train the neural network. Our network should map a text fragment of arbitrary length into a vector of a large but fixed dimension. Moreover, it should take into account the context of each word and the syntactic features of text fragments. To solve problems associated with any sequences (not only textual, but also, for example, biological) there is a whole class of networks, which are called recurrent. The main idea of ​​this network is to obtain a sequence vector by iteratively adding information about each element of this sequence. In practice, such a model has many drawbacks: it is difficult to train, and it quickly “forgets” the information that was obtained from the first elements of the sequence. Therefore, based on this model, many more convenient network architectures have been proposed that correct these shortcomings. In our algorithm we use architectureGRU . This architecture allows you to control how much information should be obtained from the next element of the sequence and how much information the network can “forget”.

In order for the network to work well with different types of translation, we trained it using both manual and machine translation examples. The network trained iteratively. After each iteration, we studied on which fragments she was most mistaken. We also provided such fragments to the network for training.

Interestingly, the use of ready-made neural network libraries such as word2vec did not bring success. We used their results in our work as an estimate of the base level, below which it was impossible to go down.

It is worth noting one more important point, namely - the size of the fragment of text that will be displayed to the point. Nothing prevents, for example, operating with full texts, presenting them as a single object. But in this case, only texts that completely coincide in meaning will be close. If only some part is borrowed in the text, then the neural network will place them far away, and we will not find anything. A good, though not indisputable, option is to use offers. It was on it that we decided to stop.

Let's try to estimate how many comparisons of sentences will need to be performed in a typical case. Suppose that both the document being checked and the candidate documents contain 100 proposals each, which corresponds to the size of an average scientific article. Then for the comparison of each candidate we need 10,000 comparisons. If there are only 100 candidates (in practice, tens of thousands of candidates sometimes rise from a multi-million index), then we will need 1 million distance comparisons to search for borrowings in just one document. And the flow of checked documents often exceeds 300 per minute. Moreover, the calculation of each distance in itself is also not the easiest operation.

In order not to compare all sentences with all, we use a preliminary selection of potentially close vectors based on LSH hashing. The main idea of ​​this algorithm is as follows: we multiply each vector by a certain matrix, after which we remember which components of the multiplication result have a value greater than zero and which less. Such a record about each vector can be represented by a binary code with an interesting property: close vectors have a similar binary code. Thus, with the correct selection of the algorithm parameters, we reduce the number of required pairwise comparisons of vectors to a small number that can be done in an acceptable time.

Step Four "In order not to violate reporting ..."


We will display the results of our algorithm - now when a user loads a document, you can select a check against a collection of transferable borrowings. The result of the check is visible in your account:



Practice Check - Unexpected Results


So, the algorithm is ready, it was trained on model samples. Will we be able to find something interesting in practice?

We decided to search for transferable borrowings in the largest electronic library of scientific articles eLibrary.ru , the basis of which are scientific articles included in the Russian Science Citation Index (RSCI). In total, we checked about 2.5 million scientific articles in Russian.

As a search area, we indexed a collection of English-language archival articles from elibrary.ru funds, open access magazine sites, arxiv.org, and English-language Wikipedia. The total source database in the combat experiment amounted to 10 million texts. It may seem strange, but 10 million articles is a very small base. The number of scientific texts in English is estimated at least in billions. In this experiment, having a base in which there were less than 1% of potential sources of borrowing, we considered that even 100 detected cases would be a success.

As a result, we found more than 20 thousand articles containing large amounts of transferable borrowings. We invited experts to verify the cases in detail. As a result, it was possible to check a little less than 8 thousand articles. The results of the analysis of this part of the sample are presented in the table:

Type of operationnumber
Borrowing2627
Transferable borrowings
(text translated from English and issued as original)
921
Borrowing "vice versa" - from Russian to English (determined by publication date)1706
Legal borrowing2355
Bilingual articles
(works of the same author in two languages)
788
Quotations of laws
(use of wording of laws)
1567
Self-citation
(translated citation by the author of his own English-language work)
660
Erroneous operations
(due to incorrect translation or neural network error)
507
Other
(checked articles contained fragments in English, or difficult to attribute to any category)
1540
Total7689

Some of the results relate to legal borrowing. These are translated works by the same authors or co-authored, part of the results are the correct operation of the same phrases, as a rule, of the same legal laws translated into Russian. But a significant part of the results are incorrect transferable borrowings.

Based on the analysis, several interesting conclusions can be made, for example, on the distribution of the percentage of borrowings:



It can be seen that small fragments are most often borrowed, however, there are works borrowed entirely and completely, including graphs and tables.



From the histogram below, it is clear that they prefer to borrow from recently published articles, although there are works where the source dates from, for example, 1957.

We used the metadata provided by eLibrary.ru, including the area of ​​knowledge of the article. Using this information, it is possible to determine in which Russian scientific fields they most often borrow by translation from English.



The most obvious way to verify the correctness of the results is to compare the texts of both works - the test and the source, putting them side by side.



Above is the work in English with arxiv.org, below is the Russian-language work, which is completely and completely, including graphs and results, is a translation. Corresponding blocks are marked in red. It is also noteworthy that the authors went even further - they also translated the remaining pieces of the original article and published a couple of “their” articles. The authors decided not to refer to the original. Information on all found cases of transferable borrowings was submitted to the editors of scientific journals that issued relevant articles.

Thus, the result could not but please us - the Anti-Plagiarism system has received a new module for detecting transferable loans, which now checks Russian-language documents from English sources.

Create with your own mind!

Also popular now: