Hanazono December 3, 2014 at 15:19

Preprocessors and metalanguages in error correction programs

From the sandbox

Computer linguistics is very conservative, despite the huge number of programs already created that solve the very complex tasks of pre-processing targeted languages (such programs are rarely used in spell checking programs). Further on the example of the generally accepted “complex” case of adjusting “to” and “to” I will try to show how the “conservatism” of programmers leads to a specific class of errors.

A. Reformed, being a very perspicacious linguist, wrote that there are errors as a result of inept teaching in grammar school. Alpatov, a man of causticism, remarked, and quoting: "One can say that Russian grammar was taken as the basis for the presentation of St. Petersburg Germans about the Russian language." Given the defects of school teaching and the particular psychology of the compilers of grammars, they are and remain outcasts in computer spell checkers.

Apparently, these cases owe their gloomy glory also to the fact that the grammarians strongly recommend “checking” the spelling with the questions “what to do” and “what does”, excluding other ways. Of course, if you act in the recommended manner, no algorithmization and subsequent adjustment are possible. All that remains is the “anagrammatic approach” (when we get several variants of the “corrected” word at the program output). Apparently, from here an attempt to divide S.A. Krylov programs for purist and laxist.

The "fixation" of programmers working in a team with professional linguists on the analysis of sentences is obviously associated with a lack of understanding by linguists of the principles of programming and the "imposition" of linguistic representations on programmers. And here the respected S. A. Krylov demonstrates this - seepost on a famous forum. This is a linguistic view, but not the view of a programmer, for whom questions of a different plan remain important: is it possible to algorithmize a grammar rule or is algorithmization impossible, you should use the "dictionary" approach to check the word.
Correcting the spelling of reflexive verbs is surprisingly easy in 40% (or more) cases, if you abandon "what does what to do," and understand the reflexive verb exactly as it should, by its meaning: actually reflexive; reciprocal return; objectless return; revertible & etc. In this case, the task of correcting the word is reduced to a) "preprocessing" processing of the phrase, word; b) creating a simple metalanguage that allows you to use certain descriptions for the "rules", and this metalanguage will look like a classic stream editor, that is, a well-known class of programs.

So, let “fuel” for the “preprocessor” we have an array of seven (or less) last letters of words on and off (for example, take the dictionary of Zaliznyak). If you take a larger number of letters, then the "accuracy" will increase, this is obvious.
We will place the received data in the array according to the principle of “herringbone with roots in the sky” - this will optimize and speed up the search, as well as save from possible errors (see code).

If someone dares to repeat my experiment with Zaliznyak’s dictionary, the result is hardly surprising: in such an array there will be only 3548 endings (i.e. seven or less last letters in a word) when it is written unambiguously or tsya. The number of endings where the alternation of "tsya" / "tsya" is possible is just as small - only 407. It's amazing, right? Indeed, now it’s enough to “drive” the verifiable word through arrays and we will get rid of incorrect spellings of words like “seems”, “have to”, etc. and the notorious "anagrams." (For the second array, when writing options are possible, you will need to use a "metalanguage".)

This is how the array “only one option is possible” looks like (of course, this is a few lines from 3548):

// “Fir-tree in the sky”
For the last one, the spelling is incorrect before the separator, after that it is correct.
scamper :: scamper
yachkat :: scamper
scamper :: scamper
scamper :: scamper
scamper :: scamper
scamper :: scoffgate
curse :: scamper
scout :: suck
scamper :: curse

For scam - before the separator the spelling is incorrect, after - the correct:

scribble: : I am
looking for it :: it is trying to be
ridiculous :: it is rushing to
yawn :: it is yawning to
yawning :: it is yawning to be
yawning :: it is yawning to be
yawning :: rushing to
yumming :: being wishing to
yawning :: hoying
yutsya are ::

// Array for cases where it is impossible to determine the correct spelling (for tsya, but it is obvious that for tsya need only to insert the soft sign)

yashitsya
yachitsya
yachatsya
tub
learning
slyatsya
rschitsya
oitsya
nyatsya
nutsya
yatsya
ytsya
utsya
stsya
GSI
etsya
atsya

For example, - A simple code that allows you to search for matches in the databases:

string correction_verbs(string str) {
// Здесь нельзя использовать map, так как map автоматически сортирует массив
vector < pair < string, string > >data;
vector < pair < string, string > >::iterator it;
// Классы и функции настолько очевидны, что описывать их бессмысленно.
file_operations file_io; // это просто класс для чтения из файлов в массив
string_utilities str_ut; // это класс, где существует утилита replace_all
string first_str, second_str, separator;
// Получаем пути к файлам с данными и загружаем массив
string verbs = global::file_paths.find("verbs_cfg")->second;
data = file_io.readf_vector_pair(verbs, separator); 
for (it = data.begin(); it != data.end(); ++it) {
// Выводим графему после разделителя (может быть любым) и до разделителя
first_str = it->first; // неправильное написание
second_str = it->second; // правильное
// Поиск совпадений в строке
// Если найдено совпадение, то заменяем ться на тся (или наоборот)
if (str.find(second_str) != string::npos) {
str = str_ut.replace_all(str, second_str, first_str);
// Прекращаем работу функции
break;
}
}
// Очищаем массив
data.clear();
return str;
}

Despite the solution that is so obvious and often used in programming, for example, when parsing artificial languages, the Orfo linguistic program on the market (not the most unsuccessful, perhaps even perfect in many respects) is unable to fully solve the problem - it does not have a “preprocessor” , instead of the “preprocessor”, all the same notorious anagrammatic algorithms for “calculating distances” are used, inevitably “forcing” Orfo to offer some insane versions of corrections.

We look here: online.orfo.ru
We enter the phrase with the error: "have to fix the error."
At the exit we get, as expected, the following passage: to dress up, you have to, take a laugh, dress up, beat.
Draw your own conclusions.
(Perhaps, a worthy example of how linguists “bent” programmers, having received an inferior and illiterate solution.)

Let's look at the work of the preprocessor described above. In the str variable, we will have the word "get it." In the array there will be only reflexive verbs "with an unambiguous spelling." At the output, we get (the program searches for matches by the last letters, starting from the top of the array) the word where the ride is unambiguously replaced by the ride (see database, there are no other options). If the program does not find matches in the array, the function will return the word without changes. Of course, the next step is to check in the database “tys” / “tts” using a certain set of descriptions. But I will write about “metalanguages” in such linguistic programs in the next post so that the message does not turn out to be bulky.

PS Of course, “preprocessing” processing has its drawbacks, but the program “thinks humanly”, and the output is still a more sane result.

Tags:

computer linguistics

Preprocessors and metalanguages ​​in error correction programs

Also popular now:

Preprocessors and metalanguages in error correction programs