Multilingual spellchecker for programs using Hunspell

Many often face the need to check spelling in several languages ​​at the same time, but not all existing programs allow such checking, suggesting the user switch from one language to another, which is quite inconvenient and time-consuming.

Not wanting to put up with such inconvenience for programs using Hunspell dictionaries (FireFox, Seamonkey, Miranda, etc.), it was decided to create an automatic graphic utility for gluing several languages, with the possibility of further use of the resulting dictionaries.


History of creation

A couple of lines about the history of creation. The idea originated in 2008, when I compiled a comprehensive Russian-English dictionary for FireFox.

It was posted on the Mozilla ftp site.
ftp.mozilla-russia.org/dictionaries/ru-en_spell_dictionary.xpi
There was also a topic on the forum.
forum.mozilla-russia.org/viewtopic.php?id=15316 A

lot of time has passed since that moment, but just recently I almost simultaneously received several letters from interested people who asked for something more recent.
Rather than send ready-made updates, I decided to finalize the GUI utility, which would allow users to independently assemble several dictionaries together.

At that moment, the utility was written “for myself” in Delphi, which was used by me at work, but this cannot be called a cross-platform solution.
Of course, now you can use the latest versions of Embarcadero RAD Studio to create cross-platform solutions, however, I decided to focus on the implementation of an automatic utility using Java.

Task

The utility in its minimal implementation should be able to

1. Download dictionaries from the following most common formats
- uncompressed * .dic and * .aff
- ZIP archive (* .zip)
- XPInstall format (* .xpi)
- OpenOffice extensions (* .oxt)

2. Provide the ability to select dictionaries for gluing

3. Provide the ability to change the name of the resulting dictionaries and descriptions

4. Unload in formats
- uncompressed * .dic and * .aff
- ZIP archive (* .zip)
- XPInstall format (* .xpi)

Implementation

Before starting to create the program, it was necessary to study the format of dictionaries in order to be able to download and glue them.

Hunspell information can be found here
hunspell.sourceforge.net

Format Description
pwet.fr/man/linux/fichiers_speciaux/hunspell
or in Russian translation
mozilla-russia.org/projects/dictionary/hunspell.html

short, to check the spelling Hunspell need two files . The first file is a dictionary containing words (* .dic), the second is an affix file (* .aff), which defines the values ​​of special labels (flags) in the dictionary. Flags are assigned to words in the dictionary file, and are defined in the affix file.

Given the format and structure of the files, the main task was to, in addition to simply gluing dictionary files, not to break the affixes for different dictionaries.

There are three approaches to naming flags in the affix file
1. By default, each affix is ​​named with one letter (case sensitive) or a number.
2. Long - each affix is ​​referred to as two letters or a letter with a number.
3. Number - each affix has a value from 1 to 65000.

Since in most cases (the dictionaries I came across) the affix file contained only dozens of different affix flags, the dictionary authors could use the first approach with one letter, but it was clearly not suitable for gluing several files due to the large number of different affixes , so it was decided to use digital naming in the resulting files. Of course, there is a minus - a slight increase in file size, but I think this is not critical.

Also, all dictionary files were often in different encodings, so the common resulting encoding UTF-8 was chosen for unification.

Otherwise, there were no particular visible problems.
The program loads dictionaries, glues them, ignoring duplicate words, as a result discards unused affix flags.

In the framework of this article, I will not go into the implementation of individual procedures, since for those who wish I have posted the source code here.
There is also a script for the ANT collector.

code.google.com/p/hunspell-merge

How to work

Currently, I have tested the utility under Ubuntu and Windows 7 with the types of source files specified in the implementation task.
This content requires the Java Runtime (JRE) .

Download from this page HunspellMerge.jar and the startup file for your Linux or Windows OS. For Linux, do not forget to put down the rights to run the file.
It is also possible to start using Java Webstart - the launch file is located at this link .





To work, you need to have a set of source files, which by default can be placed in the dictionaries subfolder of the utility working folder.

Additional dictionaries can be downloaded from the links located here
code.google.com/p/hunspell-merge/wiki/OnlineDictionaries

After starting the utility or selecting a new source folder for dictionaries, they will be displayed in the list of dictionaries.
The user only needs to select (holding Ctrl or Shift) several dictionaries, select the destination folder, indicate the name of the language, and if the download is in XPInstall format, then correct the description of the dictionary.

Using the results of work

FireFox
Output Format (XPInstall)
Just copy the file path or drag the file into the address bar of the browser.
Press enter and install the extension.

Miranda / MirandaNG
Output format (Dictionary)
Copy files (* .aff, * .dic) to the Dictionaries folder in the program directory. Restart Miranda.

Plans

Since the utility was written in a couple of evenings, there was no special time to test and support the extended instructions of the affix file. This is in the plans for revision, however, some instructions for one language may go against the same instructions for another, so for now three types of main flags are supported - suffixes (SFX), prefixes (PFX), replacement (REP).

Add Russian language for the interface.

It would also be nice to write documentation and improve the page on GoogleCode.

Conclusion

I'd love to hear in what other programs such glued dictionaries work.
I realize that the program is not without flaws, so I will be glad to hear suggestions and suggestions for improving the utility.

Thank you for your interest.

UPD As it turned out, it is useful even to simply glue together dictionaries of the same language, downloaded from different resources or in different directions, for example, technical, economic.

Also popular now: