Parsim Russian Dictionary of Zalizniak Andrey Anatolyevich

    It took me somehow to collect many Russian nouns in the singular and nominative case. He began to search the Internet. Everything that came to hand was either not in a very convenient format for me, or amateur collections. Still, I wanted more official source data, so that I could translate it into my own format, for example, into a MySQL database table.

    On September 1, 2009, an order of the Ministry of Education and Science came into force, approving the list of dictionaries, grammars and reference books recommended by the Interdepartmental Commission on the Russian Language under the Ministry of Education and Science. Among the 4 approved books is the Grammar Dictionary of the Russian Language by A. A. Zaliznyak .

    I settled on this dictionary, firstly, because it contains a morphological description of the word to pull out, for example, only verbs of a perfect form. Secondly, because I could find an electronic version of the dictionary.

    There was another version of grabbing wiktionary.org - Category: Russian nouns . It may make sense to combine these two bases, but for now let's dwell on Zaliznyak.

    Vocabulary


    Zalizniak’s dictionary was found on the site of the Babel Tower project dedicated to comparative historical linguistics. Dictionaries of Ozhegov, Zaliznyak and Vasmer are available both online and for download .

    Download the dicts.exe file from 11.27.2004 . Install. In the folder c: \ StarSoft \ dict \ the files will be located. We need only those starting with Z_ * (from Z_160 to Z_239). Words in files are grouped by first letters. Those. in the file Z_160 are all words starting with the letter A, in Z_161 - with the letter B, etc.

    Parser


    Files are encoded in OEM 866. For convenience, I converted them to UTF-8 using Notepad ++. Then he wrote a simple parser in PHP. I only needed masculine and feminine nouns. You can change the regular expression for your needs. As a result, I got a table with 39361 nouns.
     
    mb_internal_encoding('utf-8');
     
    $dir = new DirectoryIterator(dirname(__FILE__).'/dict/');
    foreach ($dir as $file) 
    {
        if($file->isDot()) {
         continue;
        }
     
        if (!preg_match_all('/^(\\p{L}{2,})\\s+\\d+\\s+(?:с|м|ж|мо|жо)\\s+/um', file_get_contents($file->getPathname()), $matches)) {
         continue;
        }
     
        foreach ($matches[1] as $word)
        {
         // делайте с $word что хотите
        }
    }



    Also popular now: