"Three in a boat, poverty and dogs", or how Antiplagiat seeks paraphrase

The new school year has arrived. The students received a class schedule and began to think about ~~drunken-gully-girl-guitars of the~~ future session. Writing term papers, diplomas, articles and dissertations are not far off. This means that the analysis of texts for the presence of borrowings, verification reports, and other headaches for students and administrators is coming. And hundreds of thousands of people (no joking - we counted!) Already have a logical question - how to deceive Antiplagiat. In our case, almost all methods of fraud are somehow connected with the distortions of the text. We have already taught Antiplagiat to detect the text “distorted” by translating from English to Russian (we wrote about this in the first article of our corporate blog). Today we will discuss how to detect the most effective, albeit time-consuming method of text distortion - paraphrase.

From Russian to Russian, or by the way

From the point of view of a ~~normal~~ ordinary person, paraphrase (paraphrase) is a rewriting of the text with other (most often with his own) words. When rephrasing, they try to preserve the meaning of the original text as much as possible, while depriving the text itself of a formal similarity with the original. In general, all paraphrases are subject to certain rules that people use most often, without even being aware of this (see, for example, the article Alberto Barrón-Cedeño ).

Let us take a closer look at the well-known story “Mumu” [as in the title of the article, it also includes a dog, people and a boat :-)], what can be done with the text so that its meaning is preserved, and the sentences look different.

1. The first thing that comes to mind is to replace most of the words with synonyms. This is the easiest thing to do with text. This will not change the meaning, and the text will change at first glance. Such a trick and use the program synonymizers. At the same time, they replace words, not taking into account the context, but simply choosing a word from the list of synonyms, therefore the sentence processed by such a program very often looks rather absurd. This paraphrase method also includes an iFrac - a descriptive designation of an object based on the selection of some of its qualities, characteristics, features, for example, “blue planet” instead of “Earth”, “one-armed gangster” instead of “gaming machine”, etc.

Original	Paraphrase
The lady began to call her to her in a tender voice.	The noblewoman started calling her to her with a courteous voice.

2. Replacing some parts of speech with others also allows you to change the structure of the sentence. For example, the verb is often replaced by a noun and vice versa.

Original	Paraphrase
One fine summer day, the lady with her hangers was pacing the living room.	The mistress’s walk with her hangers took place on a beautiful summer day.

3. Another simple way to change the structure of the text is to simply divide the sentences into simpler ones, or conversely, combine them into long ones.

Original	Paraphrase
Gerasim was a little amazed, but he called Mumu, picked her up from the ground and handed it to Stepan.	Gerasim was a little surprised, but after he called Mumu. He picked it up from the ground and handed it over to Stepan.

4. Essentially and very original proposal is changed with the help of the passive voice.

Original	Paraphrase
The mistress ordered to call her elder prizhivalku.	The elder prizhivalka was called mistress.

These are just typical tricks. Obviously, a good paraphrase is very difficult to detect. Sometimes this is only possible for specialists with deep knowledge in the subject area of the text. But for the problem that we solve, this is not required. After all, a deep rephrasing requires considerable effort, and hence a large investment of time. Most likely, it will be easier for a student to write his work than to spend time on a serious rephrasing of someone else's text, which, despite the costs, can be detected during verification.

Therefore, our goal is a relatively simple paraphrase that can be performed by the “spinal cord”, i.e. without the cost of thinking and time.

In fact, rephrasing is the “sister” of translation into another language. Words change, but the meaning remains. It can be said that the paraphrase of the Russian-language text is in fact a translation from Russian into Russian.

That is why the paraphrase detection algorithm turned out to be a “close relative” of the translation borrowing detection algorithm . So, how does the process of detecting borrowings occur in this case:

1. The Russian-language checked document arrives at the input.

2. ~~Machine translation of the Russian text into English is in progress.~~

3. There is a search for candidates for sources of borrowing on the indexed collection of ~~English~~ - ~~language~~ Russian - language documents.

4. Made a comparison of each candidate found with ~~the English version~~ check th th document ~~and~~ th - determination of the boundaries of borrowed fragments.

5. Borders of fragments are transferred to the Russian version of the document being checked. At the end of the process, a verification report is generated.

An important difference is that the parameters of the algorithm are set up on other data and taking into account the specifics of the Russian language. At the same time, we retain a tuning strategy with a focus on accuracy, sacrificing completeness. Our task is to minimize the number of false positives, even at the cost of skipping “some goals”.

Tuning from the "high tailor"

Paraphrase - this is certainly a time-consuming way to distort the text. In this case, not all methods of rewriting are equally useful make the text unrecognizable. Trying to reduce the cost of time, the author uses the most simple ways to modify the text, which are detected by the algorithms of the system and do not bring results. Therefore, after the first unsuccessful attempt to overstate originality, the text begins to " tune". How it works: various combinations of methods are used, and after each such combination, the modified text is loaded into the system - to check how successful the rephrasing was and whether the user was able to get the desired percentage of originality. The result is a chain of texts, each of which has been rephrased with varying degrees of severity. Extracting such a chain is a fairly simple engineering task. Our study of such “chains” revealed (at the same time confirming the results of the same Alberto Barrón-Cedeño ) the most frequent methods of modifications and provided rich material for learning new algorithms.

Let's do a little experiment. Take a small excerpt from the already mentioned story by Turgenev:

An hour after all this alarm, the door of the garret dissolved and Gerasim appeared. He was wearing a holiday caftan; he led Mumu on a string. Eroshka stepped aside and let him pass. Gerasim headed for the gate. The boys and all those who were in the yard spent his eyes, in silence. He did not even turn around; put on his hat only on the street. Gavrila sent after him the same Eroshka as an observer. Eroshka saw from a distance that he entered the tavern with the dog, and waited for him to go out

We will try to deceive Antiplagiat. To begin with, let's try an automatic text synonymizer. Such programs do not differ in special ~~intelligence~~ quality - they just take the words and replace them with synonyms from the dictionary, without taking into account the context. Therefore, the texts processed by such a program often look quite clumsy. Here is what happened after processing one of these programs:

In time after this disturbance, the doors of the konraks dissolved, and Gerasim introduced himself. He was wearing a solemn caftan; someone led Mumu in a string. Eroshka stepped aside and let him make his way. Gerasim rushed to the gate. The boys and all, without exception, in the yard accompanied him with his eyes, without saying a word. He didn’t turn around either: he only wore a headdress in the street. Gavrila sent after him the same Eroshka in the property of the observer. Eroshka saw from afar, the fact that someone entered the tavern together with the dog, and began to wait for his release

Note that in each sentence at least one word is replaced. Such a seemingly small change is enough for the “ordinary Anti-plagiarism” to stop associating the rewritten sentences with the original.

Now we will try to compare pairs of sentences of the source text and rewritten using our algorithm. For this we will use the cosine measure of similarity . As in the transfer borrowing detection algorithm , each sentence is represented as a vector of large dimension. By measuring the cosine of the angle between a pair of such vectors, it can be inferred how much these vectors "resemble" each other, and, accordingly, how much the sentences to which these vectors correspond.

Here is what happened after comparing the sentences with our algorithm:

For clarity, we depicted the magnitude of the cosine in the form of a thermal scale. That is, the “hotter” color between a pair of sentences, the greater the cosine value and the more similar the sentences from this pair are considered. Note that the smallest cosine values were given sentences in which substitutions for synonyms are very poor in context. For example, “so” and “thus and” are indeed very often synonymous, but in this context such a replacement is completely out of place.

Now let's try ourselves as synonymizers and rewrite the text while preserving the meaning. But unlike the program, all our changes are grammatically consistent and fit well into the context. Here's what we got:

And in this case, the algorithm gives a sufficiently high similarity estimate for most of the sentences. The sentences that received low marks were subjected to a rather deep transformation: the grammatical structure was greatly changed. Even a person will not immediately answer whether these proposals are similar, quickly running through his eyes.

And now what to do with all this?

Naturally, the best way to understand whether a new algorithm works or not is to examine the quality of its work on real data. Therefore, we installed a new paraphrase detection module in production and drove real requests through it (until we show the results to users). The works were checked both by the current borrowing search algorithm - “literal comparison”, and by the new algorithm - “paraphrase detection”. Then we compared about 10 thousand reports of checks of the loaded works created by both algorithms. The results were interesting.

This graph shows the distribution of borrowing percentages for both algorithms. It can be seen that the “detection of paraphrase” is on average 10 percent more borrowing than the “word for word”.

On the second graph, the horizontal axis represents the absolute difference between the percentage of borrowings of the proposed algorithm and the current one. A difference greater than 0 means that “detection of a paraphrase” has found more than a “word for word”.

findings

Paraphrase as a way to distort the text is really used when writing works;
The number of "positives" has not increased dramatically, the algorithm finds a really rehashed text;
As in the case of transferable borrowings, the Antiplagiat system received a new module - the paraphrase detection system;
And of course, our classic - it's better to create with your own mind!

The architecture of the paraphrase detection algorithm and the first results of the work were shown at the Big Scholar workshop on the analysis of scientific data, which this year was held at one of the main machine learning conferences - KDD 2018 .

Paraphrase detection module is deployed in production and is already used by teachers and students when checking texts for borrowing.

The article was prepared in collaboration with Rita_Kuznetsova , Oleg_Bakhteev , Kamil Safin and chernasty . The original image to create the input illustration was taken from here: demotivators.cc .

Tags:

"Three in a boat, poverty and dogs", or how Antiplagiat seeks paraphrase

From Russian to Russian, or by the way

Tuning from the "high tailor"

And now what to do with all this?

findings

Also popular now: