HabraChist or Self-defense against illiteracy

    Here's what to do with a person who
    responds to all the reproaches and calls for literacy with one argument - "I do ***"?

    - The cry of the soul

    Do not like - do not listen, but do not bother to lie!
    - Proverb



    1. Introduction


    Most recently, topics were raging on Habré with calls for universal literacy and the use of spell checkers. This seems to be a seasonal phenomenon, and the current lull is temporary. Some habrayuzers will always be sure that the main meaning in the text is not spelling; the other part will always be annoyed by constants and errors that interfere with the perception of this very meaning. One way or another, the exhortations of the latter do little to change, and the most popular answers to literacy calls can be found in the epigraph.

    In this article I propose a partial solution to the literacy problem on Habré, which will give each of the parties what they want: some will be able to write as they want, while others will be able to read Habr, without "stumbling" on mistakes. Unlike other methods (“read the textbooks”), the described approach is directly related to IT and will require less than a minute to master (if you immediately go to the “Conclusion”).

    2. Motivation


    The reason for writing this article was "intuitive" literacy, thanks to which, like many, texts with lots of errors annoy me. Moreover, recently I began to notice that the sense of language acquired in childhood for several diopters, due to lack of demand, begins to disappear, without adequate thinning of glasses.

    The second, more technical and prosaic reason for this article was a long-standing desire to deal with regular expressions. And without an interesting task, learning something is simply boring, which is why I was struck by the thought of writing HabraChista :)

    3. Approach


    3.1. Technical side


    HabraChist is a java script with a set of rules for detecting and fixing the most common errors. This script requires Greasemonkey , an extension for Firefox. The process of installing and using Greasemonkey has already been described on Habré. If desired, the script can be adapted for the Opera.

    After loading the page, HabraChist applies each rule in turn to the headings and texts of topics, as well as to comments. The rules are hardcoded in the script code in the form of regular expressions and are a lot of pairs of the form

    "(w | w) s" -> "$ 1and"

    The left side shows the regular expression that is used to search for errors. The right part, in turn, describes what you need to replace the error found. For example, given above generally corresponds to school " Ms-shi write through , and ". Instead of "$ 1", the contents of the brackets will be substituted in the found match, that is, the letter "w" or "w". Moreover, in this case it does not matter if it is a lowercase letter or a capital letter - the one used in the source text will be inserted. In those cases where the first letter of the original word and its substitution do not coincide (for example, “schaz” → “now”), you have to duplicate the rule for words starting with a capital letter (“Schaz” → “Now”).

    3.2. Data collection


    The main difficulty of the task was to find the most common errors. First, I went from bottom to top to read comments. However, it turned out that the majority of habrahlyudi with an alternative position has a high level of literacy to envy. Therefore, I had to change my tactics: taking some arbitrary word a la “seem”, I went to google to look for comments from habrayuzers who prefer to use this particular word. Of course, often in their texts there were also other "words of unconventional spelling." I learned some interesting errors (not without surprise) from the FAQ “Spelling in Russian” .
    The rule base was replenished throughout the entire testing period.

    />3.3. Limitations


    Of course, the proposed approach can only track simple errors that can be detected by searching in a string. More sophisticated ways, such as asking “what to do?” what will they do? ” for verbs go-and-go, are beyond the scope of regular expressions (however, specific rules like "go-go-go" and "go-go-go" are always followed). Many errors cannot be determined unambiguously, for example, the unified and separate spelling “not” (compare the “ugly interface” and “the interface is not beautiful, but terrible”). Nevertheless, in this class of errors there are those that, being ambiguous, nevertheless, are much more often used in misspelling than in correct spelling. For example, the spelling of the word “by the way” is very common (“By the way, I wanted to tell you ...”), although the gap in it may be justified (“By the way, the prince does not find fault”). In this case, I proceeded from the assumption that a hundred fixed errors are worth a couple added.

    The prevalence of a particular error was estimated by the results of Google, in some cases, Yandex. This method is not very accurate, because just one mistake in the title of the article is repeated by google for each comment. Unfortunately, the popularity of some errors cannot be estimated through search engines, although experience indicates their high occurrence. These include, in particular, the non-use of the soft sign in the verbs of the 2nd person of the present tense (“kachaesh”, “listen”), since Google does not allow you to search by the “* yes” mask.

    Another limitation of the method is due to the approach to data collection, many errors were simply not found. This can be quickly corrected with your help: provide comments in the comments that interfere with you on Habré, you can immediately in the form of regular expressions.

    4. Result


    You can download HabraChista from userscripts.org . The basic set of 157 rules corrects more than 70 thousand errors from different categories: grammatical, slang, Albanian, checkmate, etc.

    Below are the Top 10 grammar errors on Habré.
    ErrorRightQuantity 1)
    All the same, all the sameall the same~ 5000 2)
    Right nownow~ 4000
    Andand~ 3100
    I don’t know, I don’t want, I can’t ...I don’t know, I don’t want, I can’t ...~ 3000
    Something, somehow, somehow, like ...something, somehow ...~ 2900 2)
    Flashflash 3)~ 2800
    Hardly, vryatli, lieunlikely~ 2400
    Not right, not right ... 4)wrong, wrong ...~ 2300
    * ka (well ka, give ka ...)* -kay (well, give me ...)~ 1800 2)
    Non-fiction, fiction, fictionwas not, was not, was not~ 1700
    1) The numbers are approximate, as changed by hundreds in two weeks.
    2) Because Google on the query “all the same” gives out, basically, the correct spelling “all the same”, the number of errors was estimated as follows. If on the first page of ten results there is one misspelling - we take the average number of errors at 10%, and multiplying the total number of results by 0.1 we get an estimate of the number of errors.
    3) Anticipating indignation, I refer to the source: Help Bureau Gramota.ru .
    4) A space in these cases is sometimes needed, but on Habré it is most often used out of place (see " Limitations ").

    /> 4.1. Testing


    Testing the script and replenishing the ruleset took two weeks and included reading both the main page and the most evil corners of Habr. During testing, the script, after fixing the error, left the original version in brackets so that it was possible to visually evaluate both the average number of errors and the correctness of the script. All problems noticed so far have been fixed, we will deal with the rest as we receive feedback.

    You are probably interested in the performance issue of many dozens of regular expressions when processing hundreds of comments. The results are shown in the table.
    Place of useTime 1)
    Home - Tabernacles1 sec
    Topic with 142 comments2 sec
    Topic with 385 comments5 sec
    1) Testing was carried out on a laptop four years ago (Pentium-M 1.6 GHz, 1 GB of RAM); time was measured manually, with rounding to the nearest greater second.

    Considering that most topics have less than a hundred comments, and the average Habrauser reads about one or two lines per second, the performance of the script can be considered acceptable.

    /> 4.2. Demonstration


    To demonstrate the capabilities of the script, below is a small text with a large number of the most common errors on Habré. Compare results with active HabraChist and without it.

    Attention! The text is written solely for testing HabraChista, and may not coincide with the opinion of the author.
    Caution! The text below can negatively affect a healthy psyche (I almost went crazy while composing :)

    Huyase! What is being done, people! FSE write around incorrectly.
    In my opinion, any topic, there will be several issues. For example, what kind of mudag will throw out a huge number of bukafs without checking - it’s not clear whoever is, my Mosk reinforces too many wrong bukofs. I’m not talking about some hellish dalpaeps who write nonsense like “I have legs in my mouth !!! 111dinadin” in kamenty - you have to drive such a ghost. It seems to me, if you already write - Duc we can spend a minute looking for verification, all the same it will be better!
    Hez, I can’t understand.
    And I will not. Let them write as they wish, and read will ya like ya hochyu :)

    PS to the article, here ya schyaz poprobyval write with ashipki, UTB is full pistets!
    PPS Nothing personal :) Sorry if Cho netak.


    5. Conclusion


    The article was presented HabraChist - Greasemonkey script to fix grammar errors on Habr "on the fly." The script uses a set of rules based on regular expressions. According to rough estimates, HabraChist eliminates more than 70 thousand spelling errors. Further development of the project depends entirely on your comments.

    Acknowledgments



    License. The script is completely free for non-commercial use, modifications and improvements. When using rules from a script, please indicate their author (that is, me, YasonBy ).

    Excuse. As mentioned at the very beginning of the article, I am not an expert on regular expressions (and even a java script too). Constructive criticism and tips for improving HabraChista will be greatly appreciated.

    UPD: Many Habrovsk citizens criticize the script for restricting their freedom of expression. But at the same time, most probably use banner cutters and advertisers, not at all worrying about the freedom of the advertiser to show them their goods ...

    UPD: HabraChist also works with Safari + GreaseKit (howprompted by XuMiX ) and Opera (Tools> Settings> Advanced> Content> JavaScript options> JavaScript user files, - specify the folder in which the js-file is located. For how-to thanks tequibo ).

    Also popular now: