graninas February 21, 2011 at 10:18

Text Analyzer: Authorship Recognition (Start)

Good afternoon, dear harazhiteli. I have long wanted to publish my “Text Analyzer” under the GPL-license ( [1]) Finally, the hands reached. “Text Analyzer" is a research project that I developed for three years at the 3rd, 4th and 5th courses of the university. The main goal was: to create a text authorship recognition algorithm using Hamming or Hopfield neural networks. The idea was this: these neural systems recognize images, and the task of identifying images can be reduced to the task of identifying authorship. To do this, it is necessary to collect statistics for each text, and the more different criteria, the better: frequency analysis of letters, analysis of word / sentence / paragraph lengths, frequency analysis of two-letter combinations, and so on. The neurosystem could reveal the characteristics of which texts are most similar. There was work - the shaft. Lots of code, tricky algorithms, OOP, design patterns. In addition to the main task, I also implemented another know-how: “Card of harmony”. As planned such a card should show all bad and good-sounding places, highlighting them with color. The criteria for evaluating harmony must be given in some universal way, for example, by rules. For this purpose, I even developed a special graphic language, RRL (Resounding Rules Language). There was work - the shaft. Lots of code, tricky algorithms, OOP, design patterns. The result was a large and complex program, albeit with an unsightly interface. With this project, I even won a thesis competition, got 1st and 3rd places at university conferences, as well as 2nd place at the international scientific and practical. RRL (Resounding Rules Language). There was work - the shaft. Lots of code, tricky algorithms, OOP, design patterns. The result was a large and complex program, albeit with an unsightly interface. With this project, I even won a thesis competition, got 1st and 3rd places at university conferences, as well as 2nd place at the international scientific and practical. RRL (Resounding Rules Language). There was work - the shaft. Lots of code, tricky algorithms, OOP, design patterns. The result was a large and complex program, albeit with an unsightly interface. With this project, I even won a thesis competition, got 1st and 3rd places at university conferences, as well as 2nd place at the international scientific and practical.

More than two years have passed, and I hardly remember how it works. Together, let us try to understand what was under ~~the cut~~ algorithm hood, which recognizes the authorship. Well, let’s leave the harmony card for the next article.

(The article has a continuation and end .)

The structure of the article:

Authorship analysis
Getting to know the code
TAuthoringAnalyser internals and text storage
Leveling by state machine on strategies
Frequency response collection
Hamming Neural Network and Authorship Analysis

Additional materials:

Sources of the project "Text Analyzer" (Borland C ++ Builder 6.0)
Testing the Hamming neural system in Excel ( [xls] )
Transition table for spacecraft that breaks text into levels ( [xls] )
Calculation of the harmony of individual letters ( [xls] )
Presentation of the text analyzer diploma project ( [ppt] )
Presentation of the “Map of Harmony” project ( [ppt] )
All of these materials are compressed ( [zip] , [7z] , [rar] )

1. Authorship Analysis

Necessary:

upload sample texts and “key” text (authorship of which is unknown);
identify comparable sample texts:
- break texts into words, sentences, paragraphs,
- make the same length blocks of words, sentences, paragraphs for each text,
- select only comparable blocks of the same size for the levels of "word", "sentence", "paragraphs";
collect statistics on these three levels;
upload data to the Hamming neural system;
to recognize the image with its help;
to identify sample texts at all three levels that are closest to the key text by characteristics. The authors of these texts probably own the key text.

Here are some conclusions from the plan:

There will be work with large amounts of data, up to megabytes - depending on the texts.
Texts can be of very different sizes: from short stories to multi-volume novels. Therefore, it is necessary to somehow ensure the basic comparability of different works.
Point 2 is needed to increase recognition accuracy, to differentiate the process by levels and to provide basic comparability of texts.
Parsing text into words, sentences and paragraphs is a task for a state machine.
There can be many different statistics, I would like them to be collected in a universal way, so that you can always add some other collection algorithms.
The Hamming neurosystem only works with certain information, which means that you need to convert the collected data to a form that it understands.

2. Getting to know the code

To “warm up”, we first consider the class of the main form - TAuthoringAnalyserTable ( [cpp] , [h] ). (If the “warm-up” is not needed, you can immediately proceed to the next section.) The form itself is terrible, usability, we can say, at zero. But we are interested in the code, not the buttons-buttons.

At the beginning of the cpp-file we see the instantiation of classes:

Copy Source | Copy HTMLTVCLControllersFasade VCLFasade; // 1 - фасад по работе с VCL TAnalyserControllersFasade AnalyserFasade; // 2 - фасад алгоритмов анализа TVCLViewsContainer ViewsContainer; // 3 - контейнер визуальных компонентов

Here for (1) and (2) the “Facade” pattern is applied (Facade, [1] , [2] , [3] , [4] ). Inside the facade class (1), a large interface is hidden for working with visual components of VCL. It is there that the reactions to clicking the "Download Text" buttons, to updating the list of texts, and generally to any event from the form are registered. A form accesses these functions without knowing what will happen. The facade hides everything superfluous from the form. But, in fact, VCLFasade ( [cpp] , [h] ) only connects events from forms and algorithms; it does not have these algorithms, but they lie somewhere further in another facade - (2), AnalyserFasade ( [cpp] , [h]) Class (1) merely redirects calls to object (2) and does additional work, like filling in the visual component of the List. Yes, such a monstrous construction: object (1) knows about object (2) and its functions. How does he know? In the constructor of the main form, a little lower, there is a parameterization of the first facade by the second:

Copy Source | Copy HTML// ....... VCLFasade.SetAnalyserControllersFasade(&AnalyserFasade); // Параметризация одного объекта другим. // .......

Now I’m not sure anymore that class (1) is the Facade, perhaps it’s something else, or just that. It would be nice to place the facades in the "Single" pattern (Singleton, [1] , [2] , [3] , [4] ). Unfortunately, two years ago I did not think of this. Not that the program has suffered, no, everything works, as it should work. But we are losing some of the possibilities associated with the Loner pattern. After all, we can’t create several entry points into one subsystem? Can not. It would be worth the ban.

What else is interesting in the main form constructor?

Copy Source | Copy HTML// ....... VCLFasade.SetViewsContainer(&ViewsContainer); // В фасад (1) передаются контейнер представлений VCLFasade.SetAuthoringAnalyserTable(this); // и указатель на главную форму. /* Прим.: Можно сделать вывод, что где-то есть циклические включения h-файлов и, возможно, предопределения. Конечно, это плохо, но вот так вот сложилось тогда. */ // Параметризация фасада (2) репортинговой системой: AnalyserFasade.SetAuthoringAnalysisReporter(&AnalysisReporter); AnalyserFasade.SetResoundingAnalysisReporter(&AnalysisReporter); // .......

And then a big sheet for filling the container of visual components (controls). An element (for example, a button) is taken from the form and entered into the container, and in its own group:

Copy Source | Copy HTML// ....... TVCLViewsContainer * vc = &ViewsContainer; // Для сокращения имени. vc->AddViewsGroup(cCurrentTextInfo); // Создаём группу компонентов, cCurrentTextInfo - текстовое имя группы. vc->AddView(cCurrentTextInfo, LCurrentTextNumber); // LCurrentTextNumber, LCurrentTextAuthor, vc->AddView(cCurrentTextInfo, LCurrentTextAuthor); // LCurrentTextTitle и т. д. - это указатели на визуальные компоненты. vc->AddView(cCurrentTextInfo, LCurrentTextTitle); // Например: TLabel *LCurrentTextNumber; vc->AddView(cCurrentTextInfo, CLBTextsListBox); vc->AddView(cCurrentTextInfo, MSelectedTextPreview); vc->AddViewsGroup(cKeyTextInfo); // Создаём ещё одну группу. vc->AddView(cKeyTextInfo, LKeyTextNumber); vc->AddView(cKeyTextInfo, LKeyTextAuthor); vc->AddView(cKeyTextInfo, LKeyTextTitle); vc->AddView(cKeyTextInfo, CLBTextsListBox); // ...и так далее...

Quite a curious approach, although not very clear. Controls are transferred to the container (3), which, in turn, is transferred to the facade (1). There, obviously, controls are somehow used. After examining the classes TVCLViewsContainer ( [cpp] , [h] ) and TVCLView ( [h] ), it becomes clear that everything that is done with the controls is Update, Show / Hide, Enable / Disable, and in groups. You can completely update one group, hide another, knowing only the name ... For what it was needed, now I can only guess. This approach violates encapsulation, since you can do anything with controls, right down to deletion. They are placed outside the brackets of their form, which runs the risk of being changed.

There is nothing more interesting in the main form class, so let's take a closer look at class (2) ( [cpp] , [h] ). This second facade is already real, without jokes, except that the name is written through S, and not C (“Facade” - the correct spelling of a word, as it is given everywhere, including in GOF ( [1] , [2] , [3 ] )). The class simplifies the work with the analysis subsystems, hiding the real classes behind the interfaces. And there are three real classes:

Copy Source | Copy HTMLclass TAnalyserControllersFasade { TTextsController _TextsController; TAuthoringAnalyser _AuthoringAnalyser; TResoundingMapAnalyser _ResoundingAnalyser; // .......

The simple functions of the TAnalyserControllersFasade class access the more complex functions of the three real classes, but the client knows nothing about this complexity. This simplifies development and use. We load the texts (functions LoadAsPrototype (), LoadAsKeyText ()), load the analyzer settings (LoadResoundingAnalysisRules ()), start the analysis (DoAnalysis () function), and it somehow magically works somewhere there. If we take a closer look at the DoAnalysis () function, we will see that the desired analysis is called by the text name. It's good. The bad news is that when paired with a facade, this is not a very extensible solution. If I wanted to do some more analysis, for example, checking grammar, I would need to add a fourth real class - GrammarAnalyser - and register several additional functions in the facade. And if I write a super-universal text analysis tool, and I have such analyzers - darkness and darkness? Then I would have to come up with unified interfaces, raise abstraction over analyzers, make algorithms changeable in run-time ... It would turn out very ... very much. Fortunately, I suffer a little less mania of gigantism, and it was not required at that time.

3. TAuthoringAnalyser internals and text storage

Take a look at the class TAuthoringAnalyser ( [cpp] , [h] ) - a real class that does a real analysis of authorship. At the very beginning of the h-file, monstrous typedef's are striking:

Copy Source | Copy HTMLclass TAuthoringAnalyser : public TAnalyser { public: typedef map > TTextsParSentWordTrees; typedef map > TLeveledEqualifiedMaps; typedef map > TLeveledFrequencyContainers; typedef map > TIndexToAliasAssociator; typedef map > TLeveledIndexToAliasAssociators; // <Небезопасный код> // Прим.: чем этот код небезопасен, вспомнить вряд ли удастся... typedef map > TLeveledResultVectors; // // .......

These types are needed to store all the intermediate data, calculations, results. So, TTextsParSentWordTrees contains, obviously, structural text trees: “All text -> paragraphs -> sentences -> words”; TLeveledFrequencyContainers contains level-distributed frequency characteristics of texts, and so on. You can also notice that all built-in types are redefined ( [h]) TUInt == unsigned int, TTextString == AnsiString. It is hard to imagine when it could come in handy. Overridden types, of course, can be changed in no time, without making changes to the project files, but how often do such situations arise? When it suddenly turns out that a 32-bit integer is missing? When suddenly AnsiString stopped satisfying us and we wanted std :: string? The situation is too hypothetical, and it happens primarily with a poorly designed program. Be that as it may, the types are redefined, they don’t really interfere, they don’t really help, and you will have to get used to it.

Below in the protected section of our facade analyzer, objects of these and other types are declared:

Copy Source | Copy HTML// ....... private: TTextsConfigurator *_AllTextsConfigurator; TTextsConfigurator _AnalysedTextsConfigurator; TTextsParSentWordTrees _Trees; TLeveledEqualifiedMaps _LeveledEqualifiedMaps; TLeveledFrequencyContainers _FrequencyContainers; TLeveledIndexToAliasAssociators _IndexToAliasAssociators; // .......

The TTextsConfigurator class has a complex structure. His tasks include downloading, storing and providing texts - without deep copying. A program would be good if the texts passed in the parameters were completely copied. Then there would be no memory, no processor time. Therefore, TTextsConfigurator provides access through pointers. It is believed that once loaded, the text is always available. The “Text Configurator" also stores additional information: is the text an example or is it key; whether the text is activated or not (in the program you can exclude texts from analyzes), who the author is, what title, etc. How this is implemented can be viewed in the classes TTextsConfigurator ( [cpp] , [h] ), TTextConfiguration ( [cpp], [h] ) TLogicalTextItem && TMPLogicalTextItem ( [h] ), TRawDataItem && TMPRawDataItem ( [cpp] , [h] ) and TTextDataProvider ( [cpp] , [h] ). It is in this order that the objects of these classes are embedded in each other, and we get a kind of nested doll. The idea was to separate the logical and physical representation of the text, as well as provide the ability to download "raw data" from different sources, without knowing anything about the sources or the format in which the text is stored. Therefore, the raw data loader can be changed. Among other things, it uses the Smart Pointer pattern (TMPRawDataItem and TMPLogicalTextItem classes) in its hypostasis “Master Pointer” ( [1], [2] ). There is also a class hierarchy that allows you to abstract from the physical presentation of texts. All this was hardly useful to me; maybe I did an extra job, but gained a lot of experience and positive emotions.

Tags:

Text Analyzer: Authorship Recognition (Start)

1. Authorship Analysis

2. Getting to know the code

3. TAuthoringAnalyser internals and text storage

Also popular now: