Digital Dictionary from A to Z
In my understanding, one of the most useful programs on a PC and a smartphone is an electronic dictionary. In those ancient times, when I was learning a foreign language, every word had to be searched for in a paper dictionary. I did this trivial operation hundreds of times, and I had to look at some malicious words again and again, as I had time to forget their meaning. How insulting it was! How it is now, vzhuh and translation before my eyes on the screen. Search history, in case the word you are looking for is not transferred from a short-term to a long-term memory.
Let's create an electronic dictionary for StarDict / GoldenDict programs on our own. It may take many, or few man-hours, depending on the quality of the source material.
Step One: OCR
Unlike mountaineering, digitizing a dictionary is the hardest step, not the last but the first. If you have to carry out an OCR of a paper dictionary with faded pages, printed too small, with various artifacts of careless use, or in an exotic language, even FineReader will not help much. On some pages, the difference in the length of time between manual typing and OCR with error correction is negligible.
I advise you to save everything in simple text files, since advanced search and error correction, tagging, sorting conversion and other operations with a text array are unimaginable to perform with a binary file .
At this step, it is important to determine the structure of the dictionary entries. In the simplest case, there will be only two fields: a key and a value . This is sufficient, but if you need to highlight the various elements of articles, then you need to label all such elements in a certain way.
It's time to talk a bit about formats. There are many formats of electronic dictionaries, here is a list of them.
We will not analyze all formats here, as most of them are proprietary. We are interested in open standards and open source software.
Dictd
Emerged in an era when the network TCP / IP protocols easily multiplied and multiplied dictd
now is only archaeological interest. This client server protocol using TCP port 2628 is defined in RFC 2229 .
The source file for the dictionary is formatted as follows.
:статья: объяснение
For example, such a dictionary
:catalysis: "increase in the rate of a chemical reaction due to the participation of an additional substance called a catalyst, which is not consumed in the catalyzed reaction and can continue to act repeatedly.
" <a href="is.gd/v6a22Q">ref</a>.
:deconstruction:
:rendered: eg. "rendered irrelevant."
:reading: cf. 'reading of'
:minor: a minor reading.
The finished dictionary file is created by the command dictfmt
.
dictfmt --utf8 -s "Длинное имя словаря" -j dict-name < mydict.txt
As a result, 2 files are generated: dict-name.index
and dict-name.dict
. Of these, the first is obviously an index file, nothing needs to be done with it, and the second can be compressed with a command dictzip
. This command compresses the * .dict file using the utility gzip
. Immediately the question arises: why is it then necessary, if there is an ordinary one gzip
?
The fact is that it dictzip
uses extra bytes in the header of the archive files to provide pseudo-random access to the file.
Finally, the files are placed in the profile directories, because /usr/lib/dict
, we reboot the service dictd
and voila. The search syntax is simple, just type
dict WORD.
Running through dictd links is like a safari on the Internet of the 90s, alive and still kicking!
Sdict
A bold attempt by Alexei Semenov to change the world for the better with Perl magic at the time when Microsoft hadn’t played tricks with Linux and the open source community, and ABBYY Lingvo’s main source of dictionaries.
Title of the source dictionary file.
<header>
title = Sample 1 test dictionary - dictionary name;
copyright = GNU Public License - copyright information;
version = 0.1 - version;
w_lang = en - language for words;
a_lang = fi - language for articles. For further information
about language codes refer 'C:\Sdict\share\doc\iso639.htm' file;
# charset = ... - use if your source file is not in UTF-8 encoding.
</header>
The body is formatted as follows:
word___article
You can swing the version for Symbian OS, if that. The project is no longer alive, and even the dictionaries themselves can only be learned from the Time Machine .
XDXF
That's all, we are tying up with archeology and go to dictionary formats and programs suitable for using IRL.
XDXF has all the advantages and disadvantages of the XML format, which it is. The entire format syntax and examples can be viewed here .
The skeleton of a dictionary file looks as follows, consists of 2 parts: meta_info
and lexicon
.
<xdxf...><meta_info>
Вся информация про словарь: название, автор и пр.
</meta_info><lexicon><ar>статья 1</ar><ar>статья 2</ar><ar>статья 3</ar><ar>статья 4</ar>
...
</lexicon></xdxf>
There are a huge number of dictionaries in this format. The big advantage of the format is that there is no need to convert anything further. The program GoldenDict recognizes XDXF files along with a large number of other supported formats.
TSV / StarDict
StarDict and its clones are not so much about the format of an electronic dictionary, but about quality software for viewing, converting and creating such.
To create an electronic dictionary using StarDict, a TSV file is enough, which I chose to use as a digital copy of the Armenian-Russian dictionary .
Nevertheless, some formatting and layout of the dictionary file are possible , however, it cannot be compared with XDXF
.
a 1\n2\n3
b 4\\5\n6
c 789
The format defines a newline character \n
, in the case when the article is divided into paragraphs.
Step Two: Adjustment
After the first step, there will most likely be dozens, if not hundreds, of spelling, grammar and any other errors, strange characters and other OCR artifacts.
The peculiarity of dictionaries is that spell checking is needed simultaneously in two languages. Even now, in 2018, surprisingly few text editors and even office suites are able to perform this simple action.
Not holivara for, I recommend processing Teska to produce with Vim . If your favorite text editor handles it no worse, that's fine. With Vim enough command.
:setlocal spell spelllang=en,ru
to check the spelling of two dictionaries, in this case, Russian and English. Next, a list of rakes.
- Sorting the text works anyhow for non-Latin locales, especially bad where writing a letter requires more than one character, like Armenian
ու = ո + ւ
. It is necessary in such cases to independently sort the list of words using a simple Perl, or another script. - Pattern matching can also work unexpectedly for some locales, even if the text and the console itself are in UTF-8.
- When digitizing a printed dictionary, you need to be prepared not only for digitization errors, but also for errors in the printed dictionary itself. They may contain a lot there!
- If the title of the article is written in capital letters, then perhaps it should be translated into lowercase when digitizing. Not all letters have uppercase characters; in fact, not all locales even have uppercase.
Step Three: Compile the Dictionary
For the format XDXF
, as already mentioned, this step is not required. Just push the file to the folder /usr/share/goldendict
where the program will pick it up.
For a TSV file, use the utility stardict-editor
that comes with the StarDict toolkit .
At the output, the program creates the following files, like the ancient Dict.
- somedict.ifo
- somedict.idx or somedict.idx.gz
- somedict.dict or somedict.dict.dz
- somedict.syn (optional)
Files are copied to the directory /ysr/share/stardict/dic
and that's it.
PS For the Android mobile platform, GoldenDict program suddenly became paid, but you can still find the latest free version of the program on the Internet.