Plagiarism search system

    Foreword


    Pushkin
    At one time I was lucky for all sorts of strange work. For example, I almost got a job as an administrator in a synagogue. Only a hunch stopped me that they would force me to work there on Saturdays as the last goy.

    Another option was also curious. The company composed essays and term papers for American students who were to write in scrap themselves. Later, I learned that this is a fairly widespread and profitable business, which even came up with its own name - “paper mill”, but immediately this way of making a living seemed to me a complete sham. However, it should be noted that there were many interesting tasks in this work, and among them - the most difficult and cunning of those that I have done in my career, and which I can then proudly tell children about.

    Its wording was very simple. The course writers are remote workers, very often Arabs and Negroes, for whom English was not native, and they were lazy no less than the students themselves. Often they followed the path of least resistance, and instead of writing the original work, they stupidly torn it from the Internet, in whole or in part. Accordingly, it was necessary to find the source (or sources), compare, somehow determine the percentage of cohesion and transmit the collected information to incriminate the negligent.

    The matter was somewhat facilitated by the language of the coursework - it was exclusively English, without cases and complex inflectional forms; and it was greatly complicated by the fact that it was not clear which side, in general, to take on this matter.

    Pearl was chosen as the implementation language, which turned out to be very successful. It was impossible to solve this problem in any static compiled language, with their rigidity and slow start. It is possible to rewrite a ready-made solution, but it is impossible to come to it through numerous tests. Well, plus a bunch of great run-in libraries.

    Dead ends


    Departure to the embankmentInitially, picking a task was entrusted to some patlate student. He did not philosophize for a long time. If you need to search on the Internet, then you need a search engine. We cram the entire text there, and Google will find where it comes from. Then we read the found sources and compare them with the pieces of the source text.

    Of course, nothing happened with this.

    Firstly, if you send the whole text to Google, then it will be very bad to search. In the end, they have indexes stored there, in which the number of adjacent words is inevitably limited.

    Secondly, it quickly became clear that Google doesn’t like it at all when they search a lot from the same address. Before that, I thought that the phrase "Have you been banned in Google?" - it's just a joke. It turned out nothing like that. Google after a certain number of requests really bans, displaying a rather complicated captcha.

    Well, the very idea of ​​parsing HTML is not very successful - because the program can crash at any time when Google decides to slightly adjust the layout of the page with the search results.

    The student decided to encrypt and climb into the search engine through open proxies: find the list on the Internet and go through them. In fact, half of these proxies did not work at all, and the remaining half shamelessly slowed down, so the process did not end in any good way.

    And thirdly, finding pieces of text using character-by-character comparisons turned out to be unrealistically slow and completely impractical. In addition, it’s also useless - since the Kenyans had enough tricks not to copy the texts verbatim, but to slightly change the wording here and there.

    I had to start by reading specialized literature. Since the task turned out to be marginal, it was not described in any textbook or in any solid book. All I found was a bunch of scientific articles on private issues and one review dissertation of some Czech. Alas, she came to me too late - by then I already knew all the methods described there.

    Distracting from the topic, I cannot help but notice that almost all scientific articles published in competent journals are a) difficult to access and b) rather useless. Those sites where they are stored, and to which the search engine gives the first links, are always paid and very biting - usually almost ten dollars for publication. However, richering better, you can, as a rule, find the same article in the public domain. If this failed, you can try to write to the author, who, as a rule, does not refuse to kindly send a copy (from which I conclude that the authors themselves receive little from the current system, and the proceeds go to someone else).

    However, there is usually little practical benefit from each particular article. In them, with rare exceptions, there is no information on which you can sit down and immediately outline the algorithm. There are either abstract ideas without any indication of how to implement them; or a bunch of mathematical formulas, making your way through which you realize that the same thing could be written in two lines and in human language; or the results of experiments conducted by the authors, with the same comment: "not everything is clear, you need to continue further." I don’t know whether these articles are written for show or, rather, for some kind of internal scientific rituals, or if the toad presses to share real ideas that can be quite successfully used in your own startup. In any case, the erosion of science is evident.

    Incidentally, the largest and most famous plagiarism search site is called Turnitin. This is a practical monopolist in this area. His inner work is no worse classified than a military base - I did not find a single article, not even a short note, telling at least in a very general way about which algorithms are used there. A complete mystery.

    However, from the lyrics we will return again to dead ends, this time to my own.

    The idea with document fingerprinting did not materialize. In theory, it looked pretty good - for every document downloaded from the Internet, its fingerprint is considered - some long number, somehow reflecting the contents. It was assumed that a database would be established in which instead of the documents themselves url and fingerprints would be stored, and then it would be enough to compare the source text with the fingerprint database in order to immediately find the suspects. This does not work - the shorter the prints, the worse the comparison, and when they reach half the length of the source, it makes no sense to store them. Plus, the changes that suit the authors to deceive recognition. Well, plus the huge volume of the Internet - storing even the shortest prints quickly becomes burdensome due to the gigantic size of the data.

    Parse and normalize


    LumberjackAt first, this stage seems banal and uninteresting - well, it is clear that the input will obviously contain text in MS Word format, but not a text file; it must be disassembled, broken down into sentences and words. In fact, there is a huge source of improving the quality of verification, which is far ahead of any tricky algorithms. It’s like with recognition of books - if the original is crookedly scanned and smeared with ink, then no subsequent tricks will fix it.

    By the way, both parsing and normalization are required not only for the source text, but also for all links found on the Internet, so in addition to quality, speed is also required here.

    So, we got a document in one of the common formats. Most of them are easy to parse - for example, HTML reads perfectly with HTML :: Parser, all sorts of PDF and PS can be processed by calling an external program such as pstotext. Parsing OpenOffice documents is just a pleasure, you can even screw XSLT if you enjoy perversions. Only the gaddy Word spoils the general picture - it is impossible to find a more bastard text format: hellishly complicated to parse and devoid of any structure inside. For details, I refer to my previous article . If there was my will, I would never have taken it apart, but it is distributed much more than all other formats combined. Either this is Gresham’s law in action, or the machinations of world evil. If God is all-good, then why is everyone writing in Word format?

    In the process of parsing, if you get a normal format, you can learn all sorts of useful things from the text: for example, find the table of contents of a document and exclude it from the comparison process (there is still nothing useful there). The same can be done with tables (short lines in the table cells give a lot of false positives). You can calculate chapter headings, throw out pictures, mark Internet addresses. For web pages, it makes sense to exclude side columns and footers if they are marked in the file (html5 allows this).

    Yes, by the way, there may still be archives that need to be unpacked and each file extracted from there. The main thing is not to confuse the archive with any complex, packaged format like OOXML.

    Having received just the text, we can work on it more. Throwing away the title page and official information that universities require without fail (“Student work such and such,” “Professor T. Syakoy checked”) will only benefit. At the same time, you can deal with the list of references. Finding it is not so easy, because it has at least a dozen titles (“References”, “List of reference”, “Works Cited”, “Bibliography” and so on). However, it may not be signed at all. It is best to just throw it out of the text, because the list is very difficult for recognition, while creating a considerable load.

    The resulting text must be normalized, that is, ennobled, giving it a unified form. The first step is to find all the Cyrillic and Greek letters that are similar in spelling to the corresponding English. Cunning authors deliberately insert them into the text in order to deceive the check for plagiarism. But it wasn’t there: a similar trick is one hundred percent evidence and an occasion to drive such an author in the neck.

    Then all common abbreviated forms like can't can't be replaced with full ones.

    Now we need to change all the highly artistic Unicode characters to simple ones - herringbone quotation marks, inverted comma quotation marks, long and half-length dashes, apostrophes, ellipses, and also ligatures ff, ffi, stand all that. Replace two apostrophes in a row with normal quotation marks (for some reason this happens very often), and two dashes with one. All sequences of whitespace characters (and there are also a bunch of them) replaced with one regular space. Throw out after that from the text, everything that does not fit into the range of ASCII characters. And finally, remove all control characters except the usual line feed.

    Now the text is ready for comparison.

    Then we break it into sentences. It is not so simple as it seems at first glance. In the field of natural language processing, in general, everything seems easy only at first and from the outside. Sentences may end with a period, an ellipsis, an exclamation mark and a question mark, or may not end at all (at the end of a paragraph).

    Plus, points can stand after any cuts, which are not at all the end of a sentence. The full list takes half a page - Dr. Mr. Mrs. Ms. Inc. vol. et.al. pp . and so on and so forth. And plus Internet links: it’s good when the protocol is at the beginning, but it is not always there. For example, an article may generally talk about different online stores and constantly mention Amazon.com. So you still need to know all the domains - a dozen core and two hundred domains by country.

    And at the same time lose accuracy, since the whole process is now becoming probabilistic. Each particular point may or may not be the end of a sentence.

    The initial version of splitting the text into sentences was written in the forehead - with the help of regular expressions all the wrong points were found, replaced by other characters, the text was beaten by sentences by the remaining ones, then the point symbols returned back.

    Then I felt ashamed that I did not use the advanced methods developed by modern science, so I began to study other options. Found a fragment in Java, dismantled it in a couple of geological eras (oh and boring, monotonous and verbose language). Found python NLTK. But most of all I liked the work of a certain Dan Gillick (Dan Gillick, "Improved Sentence Boundary Detection"), in which he boasted that his method is utterly superior to all others. The method was based on Bayesian probabilities and required prior training. On the texts that I used to train him, he was excellent, but on the rest ... Well, not so bad, but not much better than the shameful version with a list of abbreviations. In the end, I finally returned to her.

    Web search


    Actively lookingSo, now we have the text and we need to get Google to work for us, to look for pieces scattered all over the Internet. Of course, you cannot use the usual search, but how should you? Of course, using the Google API. Something business. The conditions are much more liberal there, a convenient and stable interface for programs, no HTML parsing. The number of requests per day, although limited, but in fact Google did not check it. If you do not become impudent, of course, sending requests to millions.

    Now another question is what pieces to send the text to. Google stores some information about the distance between words. Empirically, it was found that a series of 8 words gives optimal results. The final algorithm was this:

    • We break the text into words
    • We throw out the so-called stop-words (service ones that come across most often - a, on and so on. I used a list taken from mysql)
    • We form queries of eight words with overlapping (that is, the first query is words 1-8, the second 2-9, and so on. You can even with a two-word overlap, this saves queries, but slightly worsens the quality)
    • If the text is large (> 40kb), then every third request can be thrown out, and if it is very large (> 200 kb), then even every second. This harms the search, but not so much, obviously, because plagiarists usually stick whole paragraphs rather than single phrases
    • Then we send all requests to Google, even at the same time.
    • And finally, we get the answers, sort it out, make a general list and throw out duplicates from it. You can also sort the list of received addresses by the number of duplicates received and cut off the latter, considering them not indicative and not particularly affecting. Unfortunately, here we encounter the so-called Zipf distribution, which, when looking for plagiarism, looks from every angle. This is such an exhibitor upside down with a very long and dull tail stretching to infinity. It is impossible to completely process the tail, but it is not clear where to cut it. Wherever you open, the quality will deteriorate. So with the address list like that. Therefore, I cut it, based on an empirical formula that depends on the length of the text. This, in any case, guaranteed some stable processing time as a function of the number of letters


    The algorithm worked perfectly until Google caught on and covered the laf. The API remained, even improved, but the company began to want money for it, and rather rather big - $ 4 per 1000 requests. I had to look at alternatives, of which there were exactly two - Bing and Yahu. Bing was free, but that was where his dignity ended. He was looking noticeably worse than Google. The latter may be the new Evil Corporation, but their search engine is still the best in the world. However, Bing was looking even worse than himself - through the API he found one and a half times fewer links than from the user interface. He also had a disgusting habit that part of the requests ended with an error and had to be repeated again. Obviously, this is how Microsoft regulated the flow of hits. In addition, the number of words in the search bar had to be reduced to five, stop words should be left,

    Yahu was somewhere in the middle between Google and Bing - both in price and in search quality.

    In the process, another little idea arose. The head of the department discovered a project that every day collected the contents of the entire Internet and put it somewhere on the Amazon. We could only take the data from there and index it in our full-text database, and then look for what we need in it. Well, actually write your own Google, only without a spider. It was, as you might imagine, completely unrealistic.

    Search in local database


    Local resource
    One of Turnitin's strengths is its popularity. Many works are sent there: students - their own, teachers - student's, and their search base is constantly growing. As a result, they can find stolen goods not only from the Internet, but also from last year's coursework.

    We went the same way and made a local database - with ready-made orders, as well as with materials that users applied to their applications (“Here is an article on the subject of which you need to write an essay”). Writers, as it turned out, love to rewrite their previous work.

    All this stuff was in the full-text KinoSearch database (now renamed Lucy) The indexer worked on a separate machine. Kinopoisk proved to be good - although hundreds of thousands went to the documents, I searched quickly and carefully. The only drawback is that when adding fields to the index or changing the library version, you had to reindex everything again, which lasted several weeks.

    Comparison


    Fortune telling on a camomileWell now the most vigorous - without which everyone else does not need everything else. We need two checks - first, compare two texts and determine that one contains pieces from the other. If there are no such pieces, then you can no longer continue and save computing power. And if there is, then a more complex and heavy algorithm comes into play that looks for similar proposals.

    Initially, a shingles algorithm was used to compare documents - pieces of normalized text with overlaps. For each piece, a certain checksum is considered, which is then used for comparison. The algorithm was implemented and even worked in the first version, but it turned out to be worse than the search algorithm in vector spaces. However, the very idea of ​​shingles unexpectedly came in handy when searching, but I already wrote about this.

    So, we consider a certain coefficient of coincidence between documents. The algorithm will be the same as in search engines. I will present it in a simple, collective farm way, and the scientific description can be found in the same scientific book ( Manning K., Raghavan P., Schütze H. Introduction to the information search. - Williams, 2011 ). I hope not to confuse anything, but it is quite possible - here is the most difficult part of the system, and even constantly changing.

    So, we take all the words from both articles, select the basis of the word, throw out doubles and build a giant matrix. In the columns she will have the very foundations, and there are only two lines — the first text and the second text. At the intersection we put a number - how many times a particular word was found in this text.

    The model is quite simple, it is called the "bag of words", because it does not take into account the word order in the text. But for us it’s the very thing, because plagiarists very often change their words when reformulating a text, reformulating what is written off.

    Highlighting the basics of a word in linguistic jargon is called stemming. I conducted it using the Snowball library - fast and no problem. Stemming is needed to improve recognition of plagiarism - because cunning authors do not just rewrite someone else’s text, but change it cosmetically, often turning one part of a speech into another.

    So, we got some matrix from the basics, which describes a huge multi-vector space. Now we consider that our texts are two vectors in this space, and we consider the cosine of the angle between them (through the scalar product). This will be a measure of the similarity between the texts.

    Simple, elegant and in most cases true. It only works poorly if one text is much larger than the other.

    It was experimentally found that texts with a similarity coefficient <0.4 can not be considered. However, then, after complaints from the support service about a couple of proposals not found, the threshold had to be lowered to 0.2, which made it rather useless (and here the damned Zipf).

    Well, a few words about the implementation. Since you have to compare the same text all the time, it makes sense to get a list of its basics and the number of entries in advance. Thus, a quarter of the matrix will be ready.

    To multiply vectors, I first used PDL (and what else?), But then, in pursuit of speed, I noticed that the vectors were terribly sparse, and wrote my own implementation based on Perlov hashes.

    Now we need to find the coefficient of similarity between sentences. There are two options here, and both are variations on the same theme of vector space.

    You can do quite simply - take the words from both sentences, make a vector space from them and calculate the angle. The only thing - you do not even need to try to take into account the number of occurrences of each word - all the same, words in one sentence are repeated very rarely.

    But you can do it even more cunningly - apply the classic tf / idf algorithm from the book, only instead of a collection of documents we will have a collection of sentences from both texts, and instead of documents, accordingly, sentences. We take the total vector space for both texts (already obtained when we calculated the similarity between the two texts), construct the vectors, replace the number of occurrences with ln in the vectors (occurrences / number of sentences) . Then the result will be better - not radically, but noticeably.

    If the threshold of similarity between two sentences exceeds a certain value, then we record the found sentences in the database, then to poke the similarity of plagiarists.

    And yet - if the sentence has only one word, then we do not even compare it with anything - it is useless, the algorithm does not work on such bits.

    If the similarity coefficient is more than 0.6 - then do not go to a fortuneteller, this is a paraphrased copy. If less than 0.4, the similarity is random or none at all. But in the gap a gray zone forms - it can be plagiarism, or just a coincidence, when there is nothing in common in the texts of a person.

    Then another algorithm comes into play, which I learned from a good article (Yuhua Li, Zuhair Bandar, David McLean and James O'Shea. “A Method for Measuring Sentence Similarity and its Application to Conversational Agents” ). Heavy artillery — linguistic features — is already coming into play. The algorithm requires taking into account irregular conjugation forms, relations between words like synonymy or hyperonymy, as well as the rarity of a word. All this stuff requires relevant information in a machine-readable form. Fortunately, the good people from Princeton University have long been engaged in a special vocabulary base for the English language called Wordnet . There is also a ready-made module on CPANfor reading. The only thing I did was to transfer the information from the text files in which it is stored in Princeton to the MySQL tables, and so I rewrote the module accordingly. Reading from a heap of text files is neither convenience nor speed, and storing links as offsets in a file cannot be called particularly elegant.

    Second version


    Duck with ducklings
    Hmm ... Second. And where is the first? Well, there’s nothing to tell about the first. She took the text and sequentially performed all the steps of the algorithm - normalized, searched, compared and returned the result. Accordingly, she could not do anything in parallel and was slow.

    So all the rest of the work after the first version was aimed at the same thing - faster, faster, faster.

    Since the main time spent on obtaining links and pulling information from the Internet, access to the network is the first candidate for optimization. Serial access has been replaced by parallel download (LWP to asynchronous Curl ). The speed of work, of course, has grown fantastically. Even the glitches in the module could not ruin the joy when it received 100 requests, executed 99 and hung indefinitely on the last one.

    The general architecture of the new system was modeled after the OS. There is a control module that launches child processes, allocating them with a "quantum" of time (5 minutes). During this time, the process should read from the database what it stopped there last time, perform the next action, write information on the continuation to the database and end. In 5 minutes you can do any operation, except for downloading and comparing links, so this action was divided into parts - 100 or 200 links at a time. Five minutes later, the dispatcher will interrupt execution in any way. Did not have time? You will try next time.

    However, the workflow itself must also monitor the execution of the timer, because there is always a risk of running into some website that hangs everything (for example, 100,000 words of the English language were listed on one such site - and nothing else was there. that the algorithms described above will look for similarities for three days and maybe even someday find).

    The number of work processes could be changed, in theory - even dynamically. In practice, three processes were optimal.

    Well, it’s clear that there was also a MySQL database in which texts for processing and intermediate data were stored, as well as final results. And a web interface on which users could see what is currently being processed there and at what stage it is.

    Tasks were prioritized so that more important tasks were completed faster. Priority was considered as some function of the file size (the larger, the slower it is processed) and the deadline (the closer it is, the faster the results are needed). The dispatcher selected the next task according to the highest priority, but with some random correction - otherwise the low priority tasks would not have waited their turn at all, as long as there are higher priority ones.

    Third version


    Ya Krivvedko!The third version was a product of evolutionary development in terms of processing algorithms, and a revolution in architecture. I remember that I was sticking out somehow in the cold, before an unsuccessful date, waiting for Godot, and recalled a recently read story about Amazon services. And they store files, and they make virtual machines, and even they have all sorts of obscure services of three letters. It was then that it dawned on me. I remembered the giant shrimp seen once in the Sevastopol aquarium. She stands in the middle of the stones, waves her paws and filters the water. It carries all sorts of tasty pieces to her, and she takes them, the water spits out further. And if you put a lot of such shrimp in a row, so they all filter there in twenty minutes. And if even these crustaceans and of different types will catch each of their own, then in general - what prospects open up.

    Translating from figurative language to technical. Amazon has an SQS queue service - such continuous pipelines that carry data. We make several programs that perform only one action - no context switching, child processes and other overhead costs. “From morning till night, the crane fills the same buckets with water. The Gas Stove heats the same pots, kettles and pans. ”

    The implementation turned out to be simple and beautiful. Each step of the algorithm described above is a separate program. Each has its own turn. XML messages are in the queues, where it says what and how to do. There is another control queue and a separate dispatcher program that monitors the order, updates data on the progress of work, notifies the user about problems that have occurred. Individual programs can send an answer to the dispatcher, or they can directly and in the next turn - as convenient. If an error occurs, then they send a message about this dispatcher, and he already understands.

    Error correction is automatically obtained. If the program fails to complete the task and, for example, crashes, then it will be restarted, and the failed task will remain in the queue and will pop up again after some time. Nothing was lost.

    The only difficulty with Amazon is that the queue service ensures that each message will be delivered at least once. That is, it will be delivered in any way, but not the fact that one day. One must be prepared for this and write the processes so that they respond appropriately to duplicates - or do not process them (which is not very convenient, because some kind of accounting should be kept), or they are processed idempotently.

    Files downloaded from the Internet, of course, were not sent in messages - both inconveniently and in SQS there is a size limit. Instead, they stacked up on S3, and only the link was sent in messages. The dispatcher, after completing the task, cleared all these temporary storages.

    Intermediate data (for example, how many links we need to read and how much has already been done) was stored in Amazon Simple Data Storage - a simple but distributed database. SDS also had limitations that were worth considering. For example, it did not guarantee instant updates.

    And, finally, the finished results - plagiarism texts, I began to add not to the MySQL database, but to CouchDB. Anyway, in the relational database they were stored non-relationally - in text fields in the format Data :: Dumper (this is the Perlov analog of JSON). CouchDB was all good, like the Queen of Sheba, but had one drawback, but fatal. It is impossible to access its database with an arbitrary query - for any query indexes must be built in advance, that is, they must be predicted in advance. If there is no crystal ball, then you need to start the indexing process - and for a large database it lasts several hours (!) And at the same time all other requests are not executed. Now I would use MongoDB - there is background indexing there.

    The resulting circuit had a huge advantage over the old one - it naturally scaled. Indeed, she does not have any local data, everything is distributed (except for the database of results), all instances of work processes are completely the same. They can be grouped by severity - run all the lungs on one machine, requiring few resources, and select a separate virtual server as a braking process for comparing texts. Few? Not pulling? You can have one more. Any process still does not cope? Take it out and to a separate car. In principle, this can even be done automatically - we see that one of the queues has accumulated too many raw messages, we raise another EC2 server.

    However, the harsh aunt Life, as usual, has made adjustments to this idyll. From the technical side, the architecture was perfect, but from the economic point of view it turned out that the use of SDS (and S3) is completely unprofitable. It is very expensive, especially the base.

    I had to hastily transfer the intermediate data to the good old MySQL, and store the downloaded documents on a hard drive shared via NFS. Well, at the same time forget about seamless scaling.

    Unrealized plans


    Glory to work and scienceStudying natural language processing, in particular, from Manning ’s exhaustive book , I couldn’t get rid of the idea that all the methods described there are just ad hoc tricks, tricks for a specific task, which are not general at all. Back in 2001, Lem was away from computer science, which had not come up with artificial intelligence for forty years, although there were a lot of bustles on this subject. Then he gloomyly predicted that the situation in the foreseeable future would not change. The machine didn’t understand the point, so it won’t be understood. The philosopher was right.

    The search for plagiarism was exactly the same trick. Well, I did not hope to give rise to AI and wait for a human comprehension of the text, I was not so naive, but I wandered for a long time in my head to make a parsing to at least recognize identical sentences, only standing in different voices (active and passive). However, all the natural language parsers that I found were extremely complex, probabilistic, yielded results in an incomprehensible form and required huge computing resources. In general, I think that at the current stage of the development of sciences this is unrealistic.

    Human factor


    Moon observationThe system was written in such a way as to work in a fully automatic mode, so people couldn’t bring anything to it. In addition, a very good system administrator worked with me, thanks to which all the servers were configured perfectly, and downtime of various kinds was minimized. But after all, there were still users - a support service. Well, and the authorities, of course.

    Both of them were convinced for a long time that it was not the computer that was engaged in the search for plagiarism, but the little man (or even the whole crowd) who was sitting inside the computer. He is almost like a real one, in particular, perfectly understands everything that is written in term papers on any topic, and he finds plagiarism because he keeps in mind all the contents of the Internet. However, when these little men messed up, they asked, contrary to any logic, for some reason not from them, but from me. One word - philologists.

    It took me a lot of work to explain that plagiarism is still looking for a computer that does not understand at all what it is doing. Somewhere in a year it reached the authorities, the rest, it seems, are not completely.

    Support also had another fashion - to introduce several offers to Google and joyfully inform me that Google found plagiarism, but my system did not. Well, what could I say to that? Explain about Zipf’s distribution, tell that for the sake of speed and memory reduction it was necessary to make compromises, and each such compromise meant quality deterioration? Hopelessly. Fortunately, in most of these cases, it turned out that Google found material on some paid site that the system simply did not have access to.

    There was also a trick - to report that Turnitin detected plagiarism, but our system did not. And here it was impossible to explain that Turnitin, most likely, was written by a whole team of qualified specialists with diplomas in the relevant field, and the site itself has intimate relationships with some kind of cool search engine. Again, fortunately, most of the undetected cases of plagiarism were from paid sites or from other student work, in general, in no way accessible to us.

    For several months I tried to satisfy the director’s requirement for a fixed processing time — each work should not be checked for more than an hour. It didn’t work out for me, I didn’t sleep at night until one day they told me that, in essence, they want to invent a perpetual motion machine from me - one that will grow in power with increasing load. There are no such things in life, and in the world of programs, too. When the requirement was reformulated - each work of no more than a certain volume (50 pages) should be searched for no more than an hour, if at that time there are no huge dissertations in the queue - things went smoothly. The conditions were harsh, but at least in principle feasible.

    At times, the customer service pleased. I am at a loss to explain their logic, but periodically, with a heavy load on the verification queue, they ... stuffed additional copies of the work into it. Well, that is, if there are a hundred cars in a traffic jam, then you need to drive another hundred on the road, and then things will go smoothly. I could not explain to them the mistake, and such cases were banned purely by administrative means.

    Parting words to commentators


    Singing peacockMy sad experience shows that on Habré there are a number of youngsters who, for some reason, believe that they immediately from birth are perfectly versed in all branches of knowledge invented by mankind. Directly according to Chekhov - “she’s my emancipation, she’s all fools, she’s one smart.” If you belong to such comrades and decide to write to me that I am an idiot, I don’t understand anything, I don’t understand simple things, etc., then please remember that the system I developed was operated in the tail and mane for two years, 24 hours a day , almost without downtime, and saved the customer a few bags of money. Therefore, when composing the comments of the type described above, please immediately indicate the similar characteristics of the system developed by you. Well, so that your genius is immediately noticeable, without leading questions.

    Also popular now: