How not to be dishonored with localization and internationalization

    On the subject of this article, I got a story from a very recent past. I went to the product page, called Supersite, the company We Will Not Poke LLC (but people from the domainer industry will find out). I came from my cozy office in Latvia and was surprised to find the following picture:


    And, to put it mildly, I was embarrassed by the currency in which the cost of services was indicated. After all, the second half of 2018 is in the yard, and the Latvian lat officially ceased to exist in January 2014 and was replaced by the euro. And for 4 years the company not mentioned above did not bother to conduct an audit of the locales used.

    Under the cut, I will tell you what to remember when internationalizing and localizing your product and where to draw data for periodic revisions.

    Definitions


    It will be logical to understand at the beginning what is what and agree on what we will call what we will do in the future. So…

    • Internationalization (internationalization, i18n) - preparation of a software product for working with various languages ​​and local differences without the need for a file to be refined during their implementation. Many letters, therefore I will explain with examples. To draw your site so that it works for left-to-right and right-to-left spellings without additional intervention from the layout file - internationalization. Replacing the entire text hardcode with language variables is the same. Teach the product to format dates - again it’s the same. By the way, the number 18 in i18n means only the number of letters between the first and the last in the word internationalization. I'm lazy For convenience, I will continue to use i18n.
    • Localization (localization, L10n) - adaptation of internationalized software to the standards of a specific region (locale). Those. when you give a list of language variables for translation into Bushman English and specify the format of numbers for Indians, this is L10n.
    • Locale - a set of parameters that defines the language and specific user interface settings that correspond to the habits of users from a specific region.

    Why do you need it?


    Good question. Many modern programming languages ​​contain built-in algorithms for basic localization (formatting dates, numbers, currencies). And if you don’t care, you are fully prepared to trust the technology you have chosen, and you don’t care about your users. Your i18n vision for your product does not go beyond this, the only possible reason is idle curiosity and general development.

    However, the devil is in the details. And sometimes these little things do not pay attention, which can be annoying to users. If you already have a solid experience in i18n, most likely you will find little new in this article (unless some real life examples). In this case, I will be grateful if you add comments from your experience (and correct me if I am mistaken about something). The rest, I hope, will find food for thought.

    What aspects does the locale include?


    Often the locale is defined as a combination of language and country. These parameters are enough to set the set of linguistic nuances and other parameters used in the region. Some specific tasks may require more complex divisions (for example, tax jurisdictions of some countries). In this case, the third parameter is also specified - variation (for example, for a specific region, operating system, etc.). Also, much depends on how close you want to be to your user (for example, in the Philippines there are 12 indigenous languages ​​with more than a million speakers, they would be pleased).

    So what does the locale include?

    Popular and obvious


    Formatting most of the parameters from this group provides, perhaps, the majority of modern programming languages. Although it is better to keep a close eye on them just in case. Or at least in time to update the versions of the corresponding libraries.

    • Translation - everything is clear, no programming language will do this for you. When preparing a product for translation, keep in mind a simple rule: the desired minimum unit for a language variable is a sentence (as far as possible). And the whole phrase is better. This may not be obvious if the system architect knows only one language or two grammatically close ones (I have come across such personal experiences, and as a result, the developers had to redo and rewrite a large number of text messages in their code). But language in a broad sense is a reflection of the lifestyle and cultural characteristics of a certain people.

      For example, all (or many) of you know that there is a strict order of parts of speech in English. As far as I know, in Chinese too, by the way. But in the Russian word order can either not matter at all, or change the meaning (“you are very clever” sounds like praise, and “you are very clever” - as a threat). In Arabic, there are differences when dealing with men and women, in Japanese - between social strata. Depending on how important this or that audience is for you, you should either study these details in detail with the native speaker or ignore them.
    • The date and time for the most part differ precisely in the formatting of the date. The difference in time format is basically 12 or 24 hour format. But with the date options are much more. Date formats are often assumed to be multiple. Day and month; day, month, year in numeric format; day, month, year in an expanded format. And here the number of options is growing rapidly. Somewhere the separator is a point, somewhere - a slash, somewhere in abbreviated formats, the first is the day, about a month. Even more fun deal with the extended format. Take for example the date of my birth (I'm a shy, yeah). So, on September 5, 1986, a man was born, who was blotting bytes with this opus. Let's go to the locations. Two English speaking countries to begin with.
      • USA - September 5, 1986.
      • Great Britain - 5 September 1986.

      And this is just the beginning. In English, there are no cases, but even at the beginning of our journey, en_US and en_UK are different. Look at the languages ​​of the countries closer?
      • Russia - September 5, 1986. So the cases appeared. And here surprises may begin, because The standard means of formatting the date of your programming language may not be aware of the nominative and genitive cases.
      • Latvia - and you just want to name the date or say that something happened on this date? In Russian (today) September 5 and (born) September 5 - everything is genitive. But in Latvian the simple name of the date is 1986 gada 5.septembris. And if “I was born” - 1986 gada 5.septembrī. The year is put first, the number is in the local case (a rough translation is “on 5 September”). And after all the ordinal numbers in the Latvian put a point.

      Target the whole world? Think about which date formats to use. Probably, it is better not to get involved with extended ones, the built-in formatting functions hardly take into account all the above-mentioned subtleties. And I walked through only 4 of the 195 member countries and UN observers.
    • The format of numbers also holds a lot of confusion. I know only the separator of the integer and fractional parts (usually a dot or comma) and the separators inside the integer portion (I precisely met the options "no separator", comma, space, I also allow the use of a dot and apostrophe). The role is also played by the position where delimiters are placed. Let's say we (and not only) are accustomed to put separators every 3 positions (thousands, millions, etc.). But the above-mentioned residents of India and nearby countries live their own lives. The first separator in the integer part (counting from decimal) comes after 3 positions (thousands), and then every two: lakh (100 thousand), crore (10 million), and so on. Thus, our 42 million in the Indian recording system will look like 4,20,00,000. And there they often measure the annual salary in lakhs of rupees. However, in the issue of formatting numbers with a high degree of reliability, you can rely on a programming language.
    • A currency format is essentially a formatted number, prepped with the prefix or suffix of a symbol or currency code. The main thing is to make sure that there are no adventures, as at the very beginning of the article. At the moment, especially for EU countries, because some may join the euro zone.
    • Writing actually covers a little more than just writing the whole text in a different direction in some languages. This is a piece of work for the coder or UI designer. When localizing an interface created for languages ​​from left to right, it is often completely mirrored for languages ​​from right to left (say, the logo and side bar from the site menu will be on the right).

    Less obvious


    Some data related to i18n are used quite often, but sometimes with small omissions. There a programming language will not help you, you have to work with pens.

    • Postal code. Guess how many countries do not use the postal code at all? According to an article on the Great and All-Knowing, 66! In fairness, I note that 3 of them use a similar to the postal code system, which allows you to encode up to the street / group of houses / house. But 63 more remain, in which either the postal code is not used at all, or its introduction was planned or planned. And this is almost a third of the world. Now remember how many sites you have met, where the postal code is a required field? And nothing can be done about it. Although the correct approach would be to make it mandatory only for those countries where it generally is. In addition, if you wish, you can check the input to the standards of the selected country. Fortunately, this information is available (including the link above).
    • Region. As options - the state, the region ... Another field, which they like to make mandatory, not taking into account the real situation with postal addressing in the country. Yes, even in the smallest countries there is some kind of administrative division ( details on the same wiki ), but it is not always necessary to make the field mandatory.
    • Telephone number. It consists of a country code and a national identifier. And if the list of country codes is not a problem, then there may be nuances with the validation of the national identifier. For example, what is the minimum length of the number sewn into your verification? But the real minimum number of the number is 4 digits. Yes, this concerns only two miniature territories, one of which is fifth in place from the population, the other is somewhere nearby. But here I want to focus more on validity than on the chance to get as a user one of about 1600-1700 Niue residents . The link can get an idea of ​​the length of national identifiers by country.
    • Name and appeal (title). Here many have the usual averaging. Fields for name and surname plus respectful treatment. As with the other points of this section, it all depends on how much “your” you want to be. If in general terms, the name and surname are obligatory (although in rare cases, the law may establish only one of these). For convenience, you can make fields for conversion, other names and suffixes (all these "junior", "third"). If you go into particular, the rules of writing can vary greatly from culture to culture, from language to language.
      • Russia - we all know that the full name consists of the last name, first name and patronymic. In the language, as in the country itself, the order is very conditional, therefore the surname in the address may be in first or last place, and the middle name may be omitted. Optionally, Mr. / Mrs (Mr / Madam) may be added at the start of the appeal.
      • USA - the full name often consists of the proper name (first name, “first” name), intermediate name or middle name, or initial and last name. It is recorded as standard in this order, intermediate names can be omitted. Optionally appeals can be added (most popular: Mr, Ms, Mrs, Dr).
      • Latvia - the full name consists of the name and surname, always in that order. The middle name as such exists only in the birth certificate, it is not used in other documents. There is a form of courteous treatment kungs / kundze (analogue of the Russian mister / madam), which is put after the last name (that is, at the end, not at the beginning, as in the previous versions). The surname is recorded in the genitive payment.
      • China - in the original Chinese record, the last name always goes first, then the name. There is a polite form of addressing that joins the family name (merges with it, and not with a single word). My last name Vasiliskov in the Chinese record will look like. And the Chinese equivalent of "Mr. Vasiliskov" - 瓦西里斯科夫 先生.
        Pampering with chinese
        Не относится к теме статьи, но может принести много лулзов. Если взять слово, перевести гугл переводчиком на китайский, а полученный результат разбить на 1-2 иероглифа и переводить назад, можно очень увлекательно провести время. Скажем, 瓦西里 он переводит как «Василий», 斯科夫 как «бухта», 科夫 как «Краков». Но тайный смысл древних знаний можно раскрыть на обычных словах. Скажем, телефон переводится в 电话. При этом 电 — «электричество», 话 — «слова». Другие слова с электричеством: 电池 — аккумулятор (池 — бассейн), 电脑 — компьютер (脑 — мозг), 电影 — фильм (影 — тень). С настоящими китайцами по этой части не сверялся, но время провести таким образом можно очень неплохо. А вообще интересный язык. После латышского и польского надо бы заняться…
      • Philippines - here the American and Spanish name spelling systems have historically mixed. Taken from the times of the Spanish colonization, the tradition to record the names of the mother and father was mixed with the American to give intermediate names. In the current version, the given name at birth is written in the “first name” column, the surname of the father becomes the child’s last name, the mother’s maiden name is the intermediate name.

      As you can see, the adaptation of the system for all possible recording options can make it too complex. But if one of the important markets for your product is in a particular country, you will have to try.

    More rare options


    Aspects of i18n from this category most of you are unlikely to ever need. But it may still be useful to have them in mind.

    • System of units. Are your users accustomed to meters, kilograms, liters and degrees Celsius? Or feet, pounds, gallons and Kelvin degrees? I myself have not yet been to the United States, but those who have been told that upon arrival there you find yourself in Narnia, the wondrous world of "non-system" units. And after a certain time spent there, you start to forget the system.
    • Paper size Partly related to the previous one and very useful if you generate some invoices, paper forms and something else that could potentially be printed and put in a folder for notes or filed with someone. Most countries are accustomed to the A4 format. But the United States, for example, is widely used for Letter (8 1⁄2 x 11 inches, 216 x 279 mm) and Legal (8 1⁄2 x 14 inches, 216 x 356 mm) formats.
    • Combined String Rules. One aspect that actually can often become useful, and which is difficult to implement. Under the rules of the combined lines, I mean those cases when you need to form a piece of text, and not just give a language variable. Examples include, but are not limited to:
      • declination of words related to the number (in your basket 3 products / 5 products);
      • the formation of the full name from the example above;
      • proper use of the grammatical gender in languages ​​where it exists (Dear Mr. Ivanov / Dear Ms. Ivanova), etc.

      How to deal with them? If you target a limited number of locales or expand gradually, you can think of the architecture for such slippery places. But ideally it would be neat and neutral to bypass them.

    Where to get information?


    In the text of the article I gave links to Wikipedia, but we all know that using it as a serious source of knowledge is not worth it. Fortunately, there is a Common Locale Data Repository project ( Common Locale Data Repository ) supported by the Unicode Consortium . Not only does it contain an incredible amount of aspects and locale settings and is regularly updated by the community, the data is available for free download in XML format, which allows for the correct development of the architecture to regularly and painlessly update the standards.

    I touched on only the main aspects in my opinion. If you think that I missed something, write in the comments, add. At the same time share your experience with localized products.

    Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to friends, 30% discount for Habr users on a unique analogue of the entry-level servers that we invented for you: The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps from $ 20 or how to share the server? (Options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

    VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps until December for free if you pay for a period of six months, you can order here .

    Dell R730xd 2 times cheaper? Only we have 2 x Intel Dodeca-Core Xeon E5-2650v4 128GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249in the Netherlands and the USA! Read about How to build an infrastructure building. class c using servers Dell R730xd E5-2650 v4 worth 9000 euros for a penny?

    Also popular now: