So hard to find, easy to miss, and impossible to issue

    Our rules of life: start the title of articles with the letter “T” and look for text borrowings quickly, accurately and, most importantly, beautifully . For more than a year, we have successfully found transferable borrowings and rewrite with the help of neural networks. But sometimes you need to intentionally "shoot yourself in the foot" and, limping, go another path, i.e. Do not check for paraphrase or plagiarism, but just leave a piece of text alone. Paradoxically, it hurts, but it is necessary. Let's say right away: we will not touch the bibliography. How to find it in the text? Why is it easy to say, but much more difficult to do than it seems? All this is in continuation of the corporate blog of Antiplagiarism , the only blog where they do not like strikethrough text .

    Image Source:

    Why take so long to find that one?

    First, a little theory. What is a document, and how should we deal with it? In the "Archeology of knowledge" M. Foucaultnotes: “History now organizes the document, splits it, organizes, redistributes the levels, sets the ranks, qualifies them according to the degree of significance, isolates the elements, determines the units, describes the relationship.” We, of course, are not historians of ideas, but we know from experience that a document is a patchwork quilt of various elements sewn together. What these elements are and how they are interconnected depends on a specific document. If this is, for example, student work, then most likely it will include: a title page, chapters of the main text, figures, tables, formulas, references and applications. In a scientific article, there will most likely be an annotation, but the title page may be completely missing. And the collection of articles or conference materials includes a whole host of articles, each of which has its own structure. In a word

    Ideally, everyone - both us and the teachers - would like to have an ideal document structure and process each element in a way that meets a specific task. The first step to success is to determine what the item is called. Stack_more_layers and I decided to start with last but not least , namely with such a text element as “bibliography”. This is the segment in which text borrowing is least interesting to the user. Therefore, it is necessary to show in the report that we “caught” the bibliography and did not begin to look for anything on it.

    Life is a spectacle. No matter how long it lasts. The main thing is that there should be a bibliography at the end.

    In a perfect world, everything is beautiful, and the appearance of the document as well. The text of an ideal document is structured, it is pleasant to read, and finding a bibliography by quickly pulling the slider to the very end will not be difficult at all. As practice has shown, reality has a completely different structure.

    To begin with, by “bibliography” many people equally mean the following concepts: “list of references”, “used literature”, “list of references” and even more than a hundred (sic!) Titles. In general, for such things, there are rules for the design of bibliographic references and records, according to which you can pull out a list of references from the text layer. Let's say more - there is even a GOST for the design of these records . Here, for example, the correct design of a bibliographic record for a well-known book:

    True, it is worth considering the fact that the manual on the design of records "according to GOST" takes almost 150 pages. For registration of bibliographic references in non-print publications, there is a separate GOST for more than 20 pages. However, a reasonable question arises: how many people will devote time to such an entertaining reading just to correctly draw up a few literary references? As practice shows, there are few. Of course, there are systems of automatic text layout (for example, LaTeX ), but in the student environment (and this is the majority of our "clients") they are not very common. As a result, at the input we have a text that contains (or maybe does not contain) at least some structured list of literary sources.

    We’ll clarify one more point. The fact is that we do not work directly with the downloaded work (pdf, docx, doc, etc.), but first we bring them to a unified look, namely, we extract the text layer . This means that any kind of formatting, for example, the type or size of the font, is removed from the text. As a result, we have only “raw” text at our disposal, which often looks very bad due to various extraction artifacts.

    Just note that our algorithm must be fast and dead accurate. The selection of a bibliography block is only an additional “feature” in the entire process of checking a document, so it should not spend a lot of resources. This means that the algorithm should not be overcomplicated.

    To do this, we first determine the quality metric by which we will evaluate the operation of our algorithm. We will consider our task as a classification problem. Each line of text we will relate to one of two classes - bibliography or non-bibliography. In order not to complicate life with poorly interpreted quality indicators (and there are enough of these!), We will consider the proportion of correctly and incorrectly classified lines. We act on the assumption that the incoming text layer is broken into lines. And even more so that the classification as such makes sense, we need one line to not combine the bibliography along with extraneous text. This is a pretty strong assumption, but almost all the texts that passed through our DocParser satisfy him. With a two-class classification of objects, which is our task,Precision and Recall . How it looks - see the picture below:

    Image source above: Wikipedia
    Image source below: Series: For Dummies

    The picture shows how many times the algorithm correctly (or not) classified the string, namely:

    • TP is a bibliographic string that the algorithm determined correctly;
    • TN is a line of plain text that the algorithm determined correctly;
    • FP - a line of plain text, which the algorithm defined as bibliographic;
    • FN is a bibliographic string that the algorithm defined as a plain text string.

    Another requirement is that our algorithm must be sufficiently accurate (i.e., have a sufficiently high Precision score ). This can be interpreted as "rather, we do not select something necessary than select something unnecessary."

    All May Dreams Cam Tru

    What, in your opinion, does it take the most time to solve a research problem? Algorithm development? Embedding a solution in an existing system or testing? No matter how!

    Oddly enough, most of the time is spent collecting data and preparing it. Also in this case: in order to come up with an algorithm and configure its parameters, it is necessary to have at its disposal a sufficient number of marked-up documents. That is, documents for which it is known exactly where they contain bibliographic records. It would be possible to attract third-party assessors, however, for such small tasks, you can usually get by with a little blood and mark up the data on your own. As a result, through joint efforts, we processed about 1000 documents. Of course, for training, for example, a neural network, this is not enough. However, recall that the algorithm should be simple, which means that you do not need a lot of data to configure its parameters.

    However, before developing an algorithm, you need to understand the specifics of the data. After viewing about 1000 random documents, or rather, their text layers, we can draw some conclusions about how the bibliographic text differs from the usual one. One of the most important patterns is that almost always a bibliography begins with a keyword. In addition to the popular “list of references” or “used sources”, there are also quite specific ones, for example, “Textbooks, manuals, monographs”.

    Another equally important feature is the numbering of bibliographic records. Again, it is worth mentioning that all these "signs" of the list of references are very inaccurate and it is far from always possible to find all bibliographic records in the text.

    However, even such inaccurate features are enough to develop the simplest algorithm for finding bibliographies in the text layer. Let us describe it more formally:

    1. We are looking for “treble clefs” in the text - the keywords of the bibliography;
    2. We try to find in the text below the numbering of bibliographic records;
    3. If there is a numbering, we go through the text until it ends.

    This simple algorithm shows almost 100% accuracy, but very low completeness. This suggests that our algorithm selects only bibliographic lines, but does it so selectively that it finds only a small part of the bibliography. The difficulty is that the bibliography can easily not be numbered, so we will use this algorithm as an auxiliary one.

    Let us now try to construct another algorithm that finds the remaining types of bibliographic records in the text. To do this, highlight the characteristics that distinguish lines of plain text from lines of bibliographic records. It should be noted that the bibliographic text is quite structured, although each author forms this structure in his own way. We have identified the following distinctive features of the desired lines:

    1. The presence of numbering at the beginning of the line - this was already mentioned above when describing the first algorithm;
    2. The presence in the string of numbers of years. Moreover, these should be not just four-digit numbers, (otherwise there will be many coincidences), but specific years that are most often used when quoting: from the 1900s to the present;
    3. Listing of full names of authors, editors and other people who participated in the publication of the publication, in different formats;
    4. Indication of page numbers, volumes and other information of a similar type;
    5. The presence in the line of phrases indicating the issue number;
    6. The presence of a url in the string;
    7. Addiction in a line of professional vocabulary. For the most part, these are special abbreviations, such as 'conf.', 'Scientific-practical.' and similar abbreviations.

    We define these signs as binary and train them on one of the simplest, but at the same time quite effective classifiers - Random Forest . Random Forest algorithm is an ensemble classification method. It consists of many (usually about 100) simple decision trees, each of which makes its own decision, to which class the object in question belongs to. The answer of the whole algorithm is created very simply: a class is selected that was formed by most of the decision trees:

    Image Source:

    As mentioned above, we will select the parameters of the algorithm so as to maximize its accuracy. Let's try to apply this algorithm to some document and look at the result of work:

    In the picture above, the lines that the algorithm considers bibliographic are highlighted in red. As you can see, the algorithm copes quite well with its task - there is almost no unnecessary highlight in the whole text, however, the bibliography itself is determined by "pieces". This is easily explained: since the algorithm is sharpened to high accuracy, it selects only those lines that are highly likely to be bibliographic. Unselected lines, according to the algorithm, look like pieces of plain text.

    Let's try to “comb” the result. We need to eliminate two problems: random single selections within the main text and discontinuous selection of the bibliography itself. A quick and effective solution to these problems is the “gluing” and “thinning” operations. The names speak for themselves: we will remove single-standing bibliographic lines and glue adjacent lines between which there are several unselected lines. Moreover, most likely, it will be necessary to carry out several iterations of gluing-thinning, because with one pass, single lines that are not bibliographic ones can stick together and not be deleted. We configured the parameters of gluing and thinning operations (number of passes, gluing width, deletion parameters) on a separate subsample (who does not know what “retraining” is, we recommend looking here)

    What happened after our improvements? When viewing several documents, we noticed that there are bibliographies with the following "features":

    Fortunately, we have a simple but effective algorithm that just takes into account such cases. And since this simple algorithm does not select anything other than the necessary lines of the bibliography, we can combine the results of the two algorithms without loss of quality.

    It looks pretty good. Of course, since the algorithm is probabilistic, there is the possibility that the bibliography cannot be found in the text at all. We specifically changed the list of bibliographies so that the algorithm “did not notice” it:

    But such a bibliography, in a purely subjective view, is no longer very different from ordinary text.

    В итоге что у нас получилось? Мы реализовали модуль выделения библиографических записей в загружаемых документах. Модуль состоит из двух алгоритмов, каждый из которых заточен под определенную специфику работы. Один алгоритм выделяет нумерованный блок библиографии, следующий за ключевым словом. Второй алгоритм отбирает строки, которые с высокой вероятностью являются библиографическими, а затем проводит несколько операций «склеивания» и «прореживания». Результатом работы модуля является объединение описанных алгоритмов.

    Стоит также отметить, что скорость работы алгоритма даже на больших документах довольна высока. А значит, под требования «вспомогательной фичи» в процессе проверки наш алгоритм полностью подходит.


    Вот так из спичек и желудей...

    As a result, we were able to implement a simple but effective process of extracting bibliographic records in user texts. And although this is a small part in the task of highlighting the structure of documents, nevertheless it is a huge step in improving the quality of service of the Anti-Plagiarism system . By the way, the results of our work can already be seen in user reports of the system. Create with your own mind!

    Also popular now: