Poetic discourse with a touch of reverse engineering

    "The old Assembler noticed us,
    And going down into the coffin, blessed"


    image

    Once I decided to write a program that writes poetry. The algorithm came up quickly - put rhyming words at the end of the stanzas compiled, and fill in the rest of the stanza with words taking into account rhyme, rhythm, and the probability of their finding next to other words taken from ready-made connected texts. A kind of Markov chain with rhymes screwed to them.


    Before implementing the algorithm, I decided to look at what has already been created by others. The first in Yandex search was (who would doubt it!) Yandex.Autopoet , using neural networks trained on the verses of classics. The second item was the program “Assistant Poet”, which upon closer examination turned out to be a regular dictionary of rhymes. But in third place was the site of the famous writer and seasoned fidoshnik Lleo aka Leonid Kaganov.


    Why did he end up there? Because when he was a student at the Mining Institute, Lleo wrote a poetry program as a thesis. I don’t know how poetic the defense of such a diploma was, but the program, apparently, worked well - the poems written by her were posted on the author’s site. The program itself was also found there, it worked under MS-DOS and a 32-bit DOS / 4GW extender. The source code for this version was also uploaded. From the explanatory note to the diploma, I learned that there was also a version for OS / 2, apparently even with a graphical interface, but its source code was not found. But the MS-DOS version could be run under DOSBox and see it in action: it really produced rhymed and rather coherent verses, although not very meaningful. For 1996, when Lleo wrote this program, this level of auto-generated poetry was very cool. In my opinion, they are not even worse than the verses of Yandex.Autopopoet. Or maybe Lleo became a famous writer with the help of a modified version of his program ?! (Scandals, intrigues, investigations! Just kidding, of course, but who knows ...).


    I began to study how this program works. To compose verses, she needed 2 files — a word base and markup of rhymes and the size of the verses she composed. Sources were in Assembler, about 3,500 lines of source code under TASM. The author of the program wrote about this: “I chose the solution for most of the tasks in Assembler. It was on it that I performed all the educational work that allowed me to choose a programming language. Mainly because I can write and debug a program in this language faster and easier - it allows me to interact more flexibly with the machine. ” And here I completely agree - Assembler is very flexible, and does not impose any programming paradigm. Although, of course, it’s faster to write programs in modern languages, collecting them from ready-made dice libraries. The source codes contained all the characteristic signs of the Assembler programs of that time - short, not always intelligible, names of variables and functions that were clear only to the author; fixed sizes of used arrays with the comment “probably enough”; a bunch of global variables, and in some places witty author's comments and error messages, such as “Creative Crisis !!!” at the moment when the program runs out of words for selecting rhymes. In this spirit:


    @@punkt13:
        ;call io
        ;db 13,10,'{{F_LEVEL}}=',0
        ;movzx eax,[F_LEVEL]
        ;call pr_dec
        cmp [nomer_LEVEL],0 ;13) Если слово не оконечное ■ к 16
        jne @@punkt16
        ;call io
        ;db 13,10,'ОШИБКА ПОИСКА РИФМЫ',0
        ;call key
        cmp [F_LEVEL],0 ;13.0) если ■сфера поиска■ = вся база
        jne @@punkt14
        cmp [FREE_RHYME],0  ;13.1) Если и рифма была свободная, то
        je @@error_twor     ;■ТВОРЧЕСКИЙ КРИЗИС■, конец
        stc
        ret ;вернуться с неудачей
    @@punkt14:
        cmp [F_LEVEL],1 ;14) ;Если ■сфера поиска■ = ■заданная тематика■, то
        je @@565656
        ;call io
        ;db 13,10,'поиск шел в АССОЦИАЦИЯХ, теперь будет в теме',0

    That's where it all started - carried away by studying the source, I forgot that I was originally going to implement the algorithm from scratch, and, remembering my own assembler programs, I decided to port the Lleo algorithm to a modern programming language. Moreover, this algorithm was very similar to the one that I had in mind. I chose Python as the porting language - it is very convenient to work with text on it.


    The analysis of the program began with the fact that I walked through the code and removed all the commented code, there were a lot of it. He left only the commented out debugging output - it helped to understand what was happening in this place of the program. Next, I deleted all purely utility calls, like getting command line switches and file I / O. Now, when there was only the code regarding the algorithm, I began to understand it and port it to Python. For those functions whose purpose was clear, I immediately replaced the names with clear ones or, rewriting in Python, deleted the assembler code. The rest - I began to translate line by line into Python. Of course, line by line, of course, it was said strongly - the program used global state variables widely, and to prevent this from happening in the Python code, in many places the algorithm had to be completely rewritten, without regard to the lines of the original program. At this point, the code looked like this - not Python yet, but no longer Assembler:


    randomValue = init random(777)
    if curMode=='C':            
        print'НАПИСАНИЕ СТИХОТВОРЕНИЯ',13,10,' база: ',0
        BASEname    
        NAME_SHABLON    
        call loadBASE   
        call CREATE
    else if curMode=='U':           
        print'ПРОДОЛЖИТЬ РАССТАНОВКУ УДАРЕНИЙ',13,10,' база: ',0        
        call loadBASE       
        call stat       
        call setUdarenie_N
        call saveBASE
        #        call automat - автом расст удар
        #   jmp @@udara1        
        return        

    Unfortunately, I did not immediately guess to look into the text of the explanatory note for the diploma, being sure that there was written a standard crap about the economic justification. Because of this, the verse algorithm itself and the format of the database of words, I literally reverse-engineering - by the intelligibility of the names of variables and functions, the source code of the program did not go very far from listing the disassembler. Although in general, the algorithm of verses and the format of the verbal base are described in the diploma note. I repent and sprinkle ash on my head. It is good that their analysis took no more than a couple of evenings. In addition, not all the details of the format and algorithm appeared in the documentation, so that later reverse engineering continued. To work with binary data, Python found a very convenient unpack function. And, a little tinkering with the order of bytes of data in the data (of course, here it is little endian, since the program is written for the Intel processor), I was able to load the word base. The file format with the rhythm of the verse was textual, and very simple; you did not need to parse the code to download it.


    Now it was necessary to understand the actual algorithm for writing poetry. As I wrote above, in general, it was described in an explanatory note, but some details were not there. For example, the fact that the verse template is set in the reverse order - from the end of the stanza to the beginning. As well as the fact that the Latin letter 'p' is everywhere replaced with the Russian 'p' - the FIDO heritage, where there was a glitch with the Russian “p”, and it was everywhere replaced with the Latin one, so in the Russian-language texts downloaded from FIDO “p” ”Was everywhere Latin, and it should have been converted back to Russian. Well and other similar trifles. In general, the algorithm was similar to the Markov chains with rhymes described at the beginning of the article, but it was distinguished by the fact that it used the stack to save the state while writing the stanza, with the ability to roll back the states if the algorithm reached a dead end, not finding words with the right stress and the number of syllables. The code also showed an attempt to compose poems on a given topic, for which the initial word of this topic was chosen, and then the search went on the words associated with it. But it seems that this feature did not work, and in functionmake_RND_FIELD_TEMA there is only 1 hardcoded index of the word left, from which the program begins the selection of words.


    In the process of parsing the program there were funny moments.
    For example, at the beginning of the program there was such a fragment:


    jmp @@skip      ; Это...
    db  'WATCOM' ; И это нужно для того, чтобы работало под DOS4GW
    @@skip:

    The fact is that the 32-bit DOS / 4GW extender was written for programs compiled by Watcom commercial compilers, and was itself a commercial product. And the fact that the program was compiled by the Watcom compiler was determined by the line "Watcom" at the beginning of the program code. If this line was not present, then DOS / 4GW refused to work. In fairness, I note that advanced people at that time used the PMODE / W by Tran expander, where there is no such nonsense, which is noticeably smaller in size, is free, and can be assigned to the program, while DOS / 4GW usually lies in the form separate executable file.


    There was another piece of code:


    proc bswap_eax ;я же не виноват, блин, что у 386 процессора нет bswap!
    mov [bswap_mes],eax ;приходится извращаться...

    Indeed, there was no bswap command on the 80386 processor, it appeared starting from 80486 and turned out to be very convenient, for example, for converting byte order little endian -> big endian. So the people wrote such functions and comments.


    Another funny thing happened when I tested the algorithm for writing a verse. For the test, I asked the end of the stanza so that the word “busy” was definitely inserted into the rhyme, which I definitely saw in the words database file. However, for some reason this rhyme was not found. It turned out that the word “busy” is written in the database, and “o” at the end is part of the service data - a pointer to the word associated with it. That it was the letter “o” is just a coincidence.


    When the writing of poems worked, I quickly cut down the writing of prose - the usual Markov chains, and I wanted more - so that my program could itself generate a word base from the text, and not just use the ready-made from the original program. Almost the majority of the program turned out to be devoted to the generation of the base, and due to the well-thought-out algorithm of work, this part impressed me more than the one that composes poetry. In fact, she does all the preparatory work for composing verses: she can parse words from the input text, break them into syllables, and even automatically arrange stresses based on previously collected statistics on manual arrangement of stresses on syllables. Although, the emphasis is far from always correct. And, as far as I know, in the Russian language there is no any stable rule for placing stresses. The stress base was stored in a separate file $$$$ SLOG.BSY, the format of which was not described in the diploma note. Here again I had to do a little reverse engineering.


    When the generation of the word database started working, it was already possible to start experimenting with various texts. As a result of the experiments, it turned out that the algorithm for breaking words into syllables taken from the program does not always work correctly, and I rewrote it from scratch. This also allowed us to refine the algorithm for obtaining a rhyming ending - now it works with syllables and accents, and does not just go to the vowel in order, relying on this attribute to find the right syllable.


    After that, I packed all the functionality into objects, decomposed it into modules, and quickly wrote down a Python script using these modules, working from the command line, starting with the same keys, and able to do the same as the original program. He also knows how to load databases from the original program, although he saves them in his own format - serialization of data through Python pickle. Although the algorithm of the original program is pretty shabby, in many places I left the original comments - it is interesting to read them, and they keep the spirit of that era. In addition, when launched with the –oldschool switch, the script displays in the help console the original program, where there are a bunch of greetings to different people and ships.


    Here is an example of composing verses:


    ***
     мАркеров И смОтрит нА
     конурУ И именА
     втОрник Я сначАла Ехал
     консультАция должнА
     пАмять хИтрость И даЕт
     вскАкиваю И поЕт
     нАдо А потОм Я вЫшел
     головОй И продаЕт
    

    ***
     какИми словАми сейчАс нЕ смущАет
     вокрУг стУк однАжды агА иногдА
     листОчков врАг нОмер одИн предлагАет
     прогрАмма нЕ пЕрвый трамвАй шЕл кудА
     покАзывает чтО нибУдь сО словАми
     егО зА компьЮтер забЫла включИть
     старАясь нЕ врЕмя семЕстра А сАми
     потОм Я самА проезднОй нЕ учИть

    As a result, we have a program composing graphomanic verses, with the help of which it is quite interesting to experiment with various texts.


    What else can be done to it good:


    • when generating a database of words, compile a rhyme dictionary - this will speed up the selection of words
    • teach the program to use stress dictionaries from the Internet, which will allow more correctly to automatically place stresses (but still not always correctly - in Russian there are words with the same spelling, but different stress)
    • teach the program to compose in those foreign languages ​​where the spelling uniquely determines the sound of the word. With English, this will be problematic - there, as they say, “We are writing Liverpool, we read Manchester”, but with French it’s quite real. In addition, it is easier to place stresses there - they (almost) always fall at the end of a word.

    That's actually all that I wanted to talk about in this article.
    I posted the fruits of my labors on Github .


    Thank you all for your attention!


    Also popular now: