Probabilistic morphological analyzer of Russian and Ukrainian languages ​​in PHP

    Before each site developer, sooner or later, the question arises of implementing a site search. It is desirable that the search be based on the word, i.e. did not take into account the end of words. For this purpose, the program is used stemmery, which highlight the basis of the word. Many stemmers work on the basis of a dictionary, and in order not to use huge dictionaries in small and medium-sized projects, a probabilistic morphological analyzer can be used. Its distinguishing feature is the relatively small size of the database and, accordingly, the absence of database load. Without large losses in quality allocation of the base.

    StemmingIs the process of finding the basis of a word for a given source word. The base of the word does not necessarily coincide with the morphological root of the word. The stemming algorithm is a longstanding problem in computer science. This process is used in prospecting systems to summarize a user's search query.
    Specific stemming implementations are referred to as the stemming algorithm or simply a stemmer.



    Recently, I needed a stemmer for Russian and Ukrainian with decent quality, digging through the Internet on the website of Andrey Kovalenko I found a very interesting stemmer. Description stemmer .

    It was implemented in C ++, which made me very upset. It was not upsetting me that it was written in C, but the fact that I, due to the specifics (only PHP), cannot use it. I did not come to terms with this and, armed with a debugger, ported this application to PHP.

    There is a more productive stemmer on the site in the form of a module for PHP, but for me it doesn’t really matter how many words per second it processes 12 thousand or 2-3 thousand, it will be enough for me and one thousand (I did not test the speed)

    Ported class code (stemka. php)

    How to make it work:

    Download the original library The original library from the library folder we take the dictionaries fuzzy * .inc

    We bring dictionaries into a form convenient for PHP. I converted the data to a binary file and downloaded it using the file_get_contents function.

    Before converting, you need to edit C ++ files with dictionaries.
    1. Add the tag " 2. Add to the end of the file" at the beginning of the file ?> "
    3. Replace" {"with" $ fuzzy = array ("
    4. Replace"} "with"); "

    After that, execute the conversion script and files will be converted.
        include "fuzzyuk.inc";
        $ fp = fopen ('fuzzyuk.dat', 'w');
        foreach ($ fuzzy as $ v)
        fwrite ($ fp, chr ($ v));
        fclose ($ fp);
        include "fuzzyru.inc";
        $ fp = fopen ('fuzzyru.dat', 'w');
        foreach ($ fuzzy as $ v)
        fwrite ($ fp, chr ($ v));
        fclose ($ fp);
    ?>


    If you don’t want to convert, the dictionaries that have already been converted are fuzzyuk.dat (243 КБ) fuzzyru.dat (403 КБ)

    Stemmer is ready to go. Usage example:

        include "stemka.php";
        $ stemka = new stemka ();
        $ str = 'rewrite';
        echo $ stemka-> GetStemCrop ($ str, 'uk');
    ?>
     


    or Demo

    I do not pretend to cover the whole topic, I just decided to share the code, and suddenly someone comes in handy ...

    You can criticize and minus it.

    Also popular now: