Honeyman October 28, 2009 at 02:19

International lambs

Despite the fact that the world culture represented by Wikipedia and Paul McCartney assures us that Mary had a little lamb , on the territory of one-eighth of the land, they continue to believe that, in fact, “Mary had a lamb”. Who actually was with Mary, and how to write it in different languages of the world? Let's try to find out (and also understand what the Japanese think about this) together with our beloved Python and the gettext multilingual translation support module built into it.

Let's get started

To begin with, recall that the gettext library is used to translate not only Python programs, but in many different languages. It allows us to use phrase templates in our program that can be translated using separate and independent translation files. In the program itself, we, as before, display the text immediately on the screen, on the disk, in the logs or somewhere else, just marking the translated lines in a special way; the gettext library allows you to take these translatable strings, sets of translation files, and, if there is a translation file suitable for the current language, substitute the desired line.

In Python, gettext library mechanisms are accessed using the gettext module bundled with Python. So we will not confuse the gettext system as such (an entity external to Python and an entity that is absolutely not required for its operation; nevertheless, it comes with convenient utilities for working with gettext files) and the gettext module built into Python.

First, we will write a basic program (let's call it mary.py), which we will try to translate into different languages: When using the gettext module, it is customary to mark translatable strings with a function call . While this function has not been defined (however, no one is stopping us from temporarily determining something like this ), so the program will probably not even be able to start ... but we don’t need it yet.

#!/usr/bin/python


name = _("Mary")

animal = _("lamb")


print _("%s had a little %s") % (name, animal)

_()_ = lambda x: x

You probably already thought that now we will create a new text file with associations, in which we will have to remember to specify all translatable lines from the program? In our case, there are only 3 such lines, but in a serious program there can be much more ...

Translation Template: .pot

... you almost guessed. We will create the file. But at the same time, we will take advantage of the nice feature of the gettext system - analyzing source files for translated strings. Since we prudently marked them with a call to the _ () function before this call used gettext seriously, the parser can now quickly collect them.

Since the gettext system is oriented for use in any programming language, it includes the xgettext program, which is able to generate a template file for translation from sources in a fairly large number of languages - C, C ++, ObjectiveC, C #, Java, Perl, Python, PHP, Lisp ... But this is in case you were not too lazy to install the gettext program package itself (“aptitude install gettext”, or whatever it is in your distribution). But we are writing a program in Python that is self-sufficient for translating programs; therefore, we will use the pygettext.py script (or pygettext under Unixes) included with Python.

Run pygettext: pygettext mary.py. The messages.pot file appeared in the same directory as our program, containing the following:

# SOME DESCRIPTIVE TITLE.

# Copyright (C) YEAR ORGANIZATION

# FIRST AUTHOR , YEAR.

#

msgid ""

msgstr ""

"Project-Id-Version: PACKAGE VERSION\n"

"POT-Creation-Date: 2009-10-28 01:12+MSK\n"

"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"

"Last-Translator: FULL NAME \n"

"Language-Team: LANGUAGE \n"

"MIME-Version: 1.0\n"

"Content-Type: text/plain; charset=CHARSET\n"

"Content-Transfer-Encoding: ENCODING\n"

"Generated-By: pygettext.py 1.5\n"


#: mary.py:6

msgid "Mary"

msgstr ""


#: mary.py:7

msgid "lamb"

msgstr ""


#: mary.py:10

msgid "%s had a little %s"

msgstr ""

What it is? This is a template for translating our entire program. If we have a large team of translators, then we can give this template to each translator for each target language, and he will have to return the completed template for his language to us. Typically, templates have a .pot extension, and populated files have a .po extension.

The syntax of the file is quite transparent. Comments, copyright notices for the translation, pairs from the original lines and translations. We will remove everything superfluous from the file, except for the line with "Content-Type:" and the lines necessary for the translation, indicate the encoding UTF-8 and write the translations:

Translation File: .po

msgid ""

msgstr ""

"Content-Type: text/plain; charset=UTF-8\n"


msgid "Mary"

msgstr "Мэри"


msgid "lamb"

msgstr "ягнёнок"


msgid "%s had a little %s"

msgstr "У %s был маленький %s"

In our case, the file is quite small and simple; if it were more complicated, it would be more convenient to use specialized editors of .po-files, like Poedit , or the “specialized editor of everything” Emacs .

Compiled translation file: .mo

So, we translated the lines in our program. In vain, by the way. gettext is aimed solely at translating finished finished sentences, and translating individual words and sentence templates in it is dangerous to do ... (for example, gettext does not support cases and gender at all and somehow only supports the distinction between singular and plural; so, to substitute Mary “Tanya” or “Light” will have to consider the case for each possible use of the original name.) Well, okay - in our case this is not important. Now we have a different task: to prepare the translation file for use.

It would be inconvenient to use the source text file for performance reasons (for programs in which there is a lot of translated text), therefore the gettext system uses files compiled in a special format. For compilation, we can use either the msgfmt tool from the gettext package, or the msgfmt.py tool from the Python package (in debian-like distributions, it is part of the python2.5-examples package).

msgfmt.py mary.po

We’ll use the second one: Yeah, we see the mary.mo file. Unlike mary.po, it is clearly not intended for manual editing.

Directory structure and program launch

If we prepared the program for installation in service directories, then we would create something like this hierarchy (in the case of Debian linux): the system directory / usr / share / locale, in it subdirectories for different languages - ru, en, etc .; in them - under the LC_MESSAGES directory, and there already - a file like mary.mo (with the most unique name so as not to intersect with other programs). But in our training case, we just make the locale subdirectory in our directory, create the ru / LC_MESSAGES subdirectories in it, and put mary.mo in the last one.

Now finally add gettext support to our program:

#!/usr/bin/python

import gettext


gettext.install('mary', './locale', unicode=True)


name = _("Mary")

animal = _("lamb")


print _("%s had a little %s") % (name, animal)

What changed? We have imported the gettext module (well, this is obvious). We also installed the _ () function in the global space of the program, which, to translate lines in the ./locale subdirectory (second argument), will find the directory with our current locale (the same ru subdirectory), and in its subdirectory LC_MESSAGES will search for unicode (third argument ) the mary.mo file for the translation of the mary program (first argument).

What is meant by the word "installed"? And the fact that, after this action, we can import other modules of our program, and the _ () function in them will already be defined.

We launch our program ... Yeah. Something like this.

1:/tmp/mary> ./mary.py

У Мэри был маленький ягнёнок

Bonus

According to Google Translate, the .po-file for the Japanese language will look something like this: And for normal support of the Japanese language (in addition to Russian), we will have to change the last line of code to Check it in work:

msgid ""

msgstr ""

"Content-Type: text/plain; charset=UTF-8\n"


msgid "Mary"

msgstr "メアリー"


msgid "lamb"

msgstr "子羊"


msgid "%s had a little %s"

msgstr "%sの%sいた"

print (_("%s had a little %s") % (name, animal)).encode('UTF-8')

1:/tmp/mary> LANG=ja_JP.UTF-8 ./mary.py

メアリーの子羊いた

Tags: