palexisru December 6, 2015 at 04:04

Placing a hierarchy of language dictionaries in directories

Recovery mode

Good reading time, dear participants of habrahabr.ru.

There are no objective obstacles for teaching programming to children of preschool and school age.

The main historical obstacles at the moment are
1. using only ANSI encoding characters as statements and error messages
2. insufficiently built ideology of using national languages

For example, the following hierarchy may be useful for Russia: English -> Russian -> Tatar

The article suggests directory hierarchy for use in applications or localization of a programming language that provides support for any national language, inheriting from words encoded ANSI.

For the development of programming in national languages (for example, the languages of the peoples of Russia, Eurasia, or all available on the planet, used for writing, or using unprintable words), a translation system using national symbols and dictionaries is proposed. The system includes a directory structure and an approach that reveals the writing of programs in the native language, and allows for the transfer of the source code of programs from national to ANSI and vice versa - to any other language for which there is a description. Thus, the description of the algorithms is formed in any language, and uses a hierarchy of languages.

The main basic type (universal ancestor) is draft, which includes only English words and the underscore, spaces are replaced by underscores. Instead of other ANSI characters, their verbal description is used: dot , comma , etc. The draft language is used as a universal basis for translating words and expressions. All lines in the inclusive program (translator) must be represented in this encoding.

The next type of language used for messages is ansi . It is an ancestor for languages using the alphabet, and can include any characters from the range 1-127 of the encoding table. It is logical to keep in it common expressions of the English language. String constants for this and other levels other than draftcan include any characters in encodings supported by the XML markup language - OEM, utf8, utf16, utf32. For each language, the direction of writing can be indicated:
- from left to right from top to bottom (English, Russian, etc. - by default)
- from right to left from top to bottom (Arabic, Hebrew)
- from top to bottom from left to right (Japanese, Chinese)

Directory structure

The directory structure containing dictionaries at the top level uses the designations of the continents to which the languages belong, and the subdirectories contain the names of countries or languages.

Thus, top-level catalogs are limited to the following list:

culture / af - Africa - African cultures
culture / an - Antarctica - universal prototypes - Universal Antarctical cultures
culture / au - Australia - Australian cultures
culture / ea - Eurasian - Evrasian cultures
culture / na - North American - North American cultures
culture / sa - South American - Sourth American cultures

Encodingsdraft and ansi are taken to the mainland an - Antarctica to indicate differences from the spoken dialects of English in the UK, USA and other countries:
culture / an / draft
culture / an / ansi

In this description, culture denotes a directory containing a hierarchy of languages. For a particular program, dictionaries are created in subdirectories corresponding to the languages, with a file name corresponding to the application. Also, for the most universal words, common.xml files are created in the language directory . For example, for English, this would be

culture / ea / en / common.xml

Language Inheritance

For each language, except draft , no more than one inherited language is indicated. The language draft does not inherit from any language. The language from which this language is inherited is specified in the lang.xml dictionary description file.

The entire chain of inherited languages can be displayed when viewing the source code or the result of the work of the language preprocessor. This can be convenient, for example, when checking programs in the national language, inherited from the Russian language, by an informatics teacher who does not speak the national language well enough. In addition, machine translation of the source code of programs from one national language to another in the same dictionary is possible.

For each language, there can be several different inheritance chains that are independent of each other. For example, for the Russian language, chains such as ansi -> ru or draft -> ru are possible ;
they will be contained in the directories:

culture / ea / rus / ru_ansi
culture / ea / rus / ru_draft

In addition, for multilingual countries it is possible to create a language catalog in the country subdirectory:

culture / ea / rus / tatar_ru

where:
culture is the root directory of
ea internationalization support - Eurasia
rus - Russia
tatar_ru - Tatar dictionary with translation from Russian language

Similarly, you can create a dialect based on the language culture / ea / eng / en_ansiculture / na / usa / en_en American English.

File structure

The entry point to the dictionary description is the lang.xml file, which is contained in each directory. The file contains a link to the inherited language, the names of common dictionary files that are connected by default, and may also contain a description of other features, for example, the encoding of dictionaries located in OEM text files.

The language description is stored in the culture section of the lang.xml file


        название языка
    
        название кодовой страницы по умолчанию для текстовых словарей
    
        каталог наследуемого языка от каталога culture
    
            имя файла в этом же каталоге с типом xml или txt
        
            имя файла в этом же каталоге с типом xml или txt

The from section for draft remains empty.

A simple dictionary consisting of a word in the target language and translation into the inherited language can be stored in a text file, although the use of XML files is preferable. In the case of text files, the word is separated from the translation into the inherited language by a space, one line contains one pair of words. You can consider the option using phrases, then phrases are enclosed in quotation marks and separated from the translation by a space.

Links to translate words for an XML file are in the words section , and one dictionary file may contain a link to another file of the same dictionary in the include section. In the case of a dictionary in XML format, it becomes possible to add properties related to keywords in a programming language.


       имя файла в этом же каталоге с типом xml или txt
    
           слово на языке словаря
        
            перевод на наследуемый язык
        
            краткое описание ключевого слова на целевом национальном языке

Include XML files can also have links to other files, which allows creating a modular translation structure. Re-inclusion of files is ignored, in case of conflict of translations (different translations of the same phrase), the first translation is preferred. If there is no translation from one of the chain of inherited languages, the translation in the last language found is considered the correct translation. To switch from the national language to the ANSI encoding, in the absence of a word in the dictionary, transliteration can be used.

If there are several programs

For different programs, different translation options may be required. Accordingly, for each program, you can have your own dictionary, which corresponds to the name of the application, connects first, and then connects general dictionaries from the same directory (for example, common.xml). The application should be engaged in indicating the path to the dictionary directory, the language used, the initial dictionary file, for example, through the configuration file. Work with the given modular directory structure can be implemented in the form of a library.

The proposed directory structure does not take into account parameterizable strings, but it is transparent enough to create localizations in many languages, for example, using the Git repository.

Links:
habrahabr.ru/post/176243- "National" programming languages
habrahabr.ru/post/136272 - What programming language should be the first when studying at school?
habrahabr.ru/post/20541 - About the internationalization of applications
habrahabr.ru/post/165705 - A few words about the internationalization of applications
habrahabr.ru/company/alconost/blog/173467 - How LinkedIn makes localization in 19 languages in 1 night
habrahabr.ru/ post / 267501 - Localization of Google Chrome extensions - you just need

programming languages with keywords not in English

in particular, an interesting link to www.robomind.net - robot learning environment in English and Dutch

Tags:

Placing a hierarchy of language dictionaries in directories

Also popular now: