Multilingual Badoo: “translation difficulties”



    Good localization, that is, adaptation of the application for users from different countries, will allow it to win the hearts of its audience. The bad, on the contrary, will become a real pain. For example, one of the navigators on Google Play suggests "Do not update, you did not purchase a commercial map" and it is frightening that "On some devices you will be asked to select the installation folder."

    The purpose of localization is not to make the application simply accessible in other languages, but to make each user feel that it has been developed taking into account the peculiarities of its native language.

    In this article, we will briefly talk about those aspects of localization that you need to pay attention to first of all, and share the experience that we gained while translating Badoo into 46 languages. This is a very extensive topic, and we will continue to tell in detail how we implemented these or those tools. At the end of the article, you can vote and select the aspect that you are interested in learning about first.

    Introduction


    Supporting multiple language standards is a complex multi-step task that begins with adapting your application code. Almost any text transmitted to the user (if it is not a technical component) may require modification for some languages.

    There are many solutions that allow you to separate the translated text from the untranslatable and organize a translation system without fatal flaws. We do not use ready-made solutions, we decided to build and develop our own system, independently step on all the rakes and invent bicycles. But our system turned out to be truly flexible and suitable for us in everything. Let's start with the terminology and general principles of work.

    A key element of the translation system is certain pieces of text that are compact enough to be convenient to operate with, but large enough to maintain logical integrity. We call such fragments tokens . For example, consider the Badoo messenger. This is a good example: such interfaces are sufficient in both mobile and web applications.


    Pay attention to several key points that are clearly visible in this screenshot. There are various tokens:
    • those that can be used repeatedly ("Search", "Unread", "Anonymous Chat");
    • containing variables (“View profile and 16 photos”);
    • depending on the floor ("Take the first step, write to him!");
    • depending on numerical parameters and containing declination ("2370 girls will see you here").

    Frequently used tokens, such as Search, Unread, Girl, etc. Badoo are separate from the rest and can be reused in different subsystems of our large and versatile architecture, including single translations for mobile and web applications. Key benefits of this approach:
    • reduction in the amount of work for translators;
    • uniform text style;
    • the possibility of additional processing of tokens (change depending on the number and declension).

    With tokens containing variables ("View profile and {{number}} photos"), everything is simple: you just need to remember to fill in the data.

    With dependence on numbers and declination, it’s much more complicated (“{{number}} girls will see you here”), we will discuss this topic in a separate section.

    The process of preparing and displaying translations can be a serious headache in terms of system performance, especially if you have to do this more than 20 thousand times per second (the peak load in Badoo can be higher).

    Now let's look in more detail at what you should pay attention to.

    Dialects and multi-stage “faylover”


    Some languages ​​have dialects. For example, English is British and American, and Spanish is Colombian, Argentinean, and Mexican. And even if the translations are 99% identical, it may turn out that the same phrase should sound completely different on them. If you ignore this small nuance, a big embarrassment can happen. For example, rapariga in Portuguese means “girl”, but in Brazil the word is used to mean “nocturnal butterfly”. For the Brazilian dialect, the word garota is used, which is not applicable in Portugal because it means "little girl."

    At Badoo, we built languages ​​in the form of a tree. The root element is “universal English”. Other languages ​​(including British and American English) branch out from it, some of which, in turn, have dialects.

    Translators work from top to bottom: first, universal English is translated, then second-level languages, and only then their dialects. That is, translation into Spanish takes place from universal English, and into Mexican from Spanish.

    When translations are displayed to the user, the search is performed from the bottom up. For example, for the Mexican language, the Mexican translation is first searched. If it is not found - Spanish. If not, universal English.

    Direction of writing and punctuation


    For most languages, it is enough to translate the text, and the appearance of the application and the elements surrounding the text are not modified. However, there are specific languages:
    • with reverse spelling (from right to left, for example, Arabic and Hebrew);
    • with special punctuation rules (Spanish, Japanese).

    For languages ​​with reverse spelling, it is required not only to translate the text, but also to make the interface mirror: not only the direction of the text changes, but also the direction of perception of information.


    With punctuation, there are simpler cases. For example, Asian languages ​​(Japanese, Korean) use their own UTF-8 characters for periods, exclamation points, and question marks (they look almost like ours, but not ours):
    。?!
    .?!

    And there are more complicated cases. For example, in Spanish, question and exclamation marks are duplicated upside down at the beginning of a sentence.


    And in no case can punctuation be excluded from tokens!

    Formats and Units


    There are subtle but very important differences in the formatting of dates and numbers that can give them completely different meanings in different countries.
    For example, 03/07/2013 may indicate July 3 or March 7, depending on local standards. This is a common cause of confusion between the US and the UK, where they speak the same language but use a different date format. It is not necessary to assume that if two countries speak the same language, then they will certainly understand everything the same way.

    The same thing happens with numbers. The number 1.000 can be read as “one” or as “thousand” depending on which separator is used to separate the fractional part. For example, in Korea, the dot is the decimal separator, while in Germany it is used to separate thousands.

    Special attention should be paid to the measurement system. The simplest solution is to display the user's height in feet and centimeters at the same time, but it looks unnatural. You can make a switch that allows the user to select convenient values, and set the default value based on the selected language. This applies to measures of length (height), weight, temperature scale, etc.

    Stylistics


    Different Badoo components can use different styles of text: somewhere more formal, and somewhere more youthful. For example, in terms of using the service and other official documents, it is better to translate you as “you,” while entertainment interfaces often use “you”.

    In addition, it is very important not to get confused in terminology and translate established words and phrases everywhere the same way. For example, the random dating service on Badoo in English is called Encounters. This word can be translated in different ways, but we adhere to the translation "Dating". This is extremely important, otherwise the user may not understand the promotional text calling for some action, or an error message. To solve this problem, we use two mechanisms. The first is a separate group of short tokens, which are either used very often, or may depend on gender and number. We will talk more about this group in the next section.

    The second mechanism we call TranslationMemory. It performs two functions at once:
    • reduces the amount of work for translators (and, as a result, speeds up deployment);
    • helps to withstand translations of similar tokens in one style.

    The logic of TranslationMemory is quite simple, but the implementation may be an interesting topic, and we will certainly tell you more about this in the future. In short, when translating a token, we parse the original text and translate it into smaller “strings” (parts of phrases and whole sentences) by punctuation, tags, line breaks and some other delimiters. Why threads? Because they can intersect, intertwine and include any number of other threads. The collection of all threads in the token we call the structure of the token.

    If we can clearly compare the structure of the threads from the original and the translation to each other, we save pairs of threads. In the future, when new tokens appear in the translation system, we try to find a translation for each thread. Combining the options found, we select the most complete translation. The translator can choose one of several most complete translations, assembled in pieces from different threads, as the basis of the new translation.
    For example, translating once two different Hello world and My name is John tokens, the translator can do almost nothing for the Hello world token! My name is John. TranslationMemory will offer a ready translation. The translator will only have to make sure that the punctuation marks correspond to the language.

    Sex addiction


    In different languages, gender is indicated differently: somewhere, articles and prepositions are used, somewhere endings, and somewhere everything at once. For example, in Slavic languages, almost all parts of speech can depend on gender. In addition, complex phrases may depend not only on the gender of the object, but also on the gender of the subject. The rules in some languages ​​can be so complicated that sometimes you have to duplicate the English text for several combinations of objects and subjects of different sexes and, accordingly, modify the application.

    Such situations are almost impossible to predict without being a polyglot. Moreover, we believe that developers should not think about it. Therefore, our translators have a special tool in the translation interface that allows you to "order" the division of the token by gender: a development ticket with a description of the problem is automatically created.

    Dependence on number and declension


    In most languages, there are only two forms of number dependence: singular and plural. The Russian language is an excellent example of complex rules of dependence on the number: 1 user, 2 users, 5 users. Moreover, 21 (31, 41, 101) users, but 11 users. The rules themselves are not very complicated, but we dig deeper.

    Typically, applications consider what is important to them. Social networks count users, photos, posts and likes. In the financial sector, transactions, currency and customers are considered. GPS navigators count minutes and kilometers (or miles). Those quantities are calculated whose names and units of measurement are found everywhere in the application. These are the most commonly used tokens that have been repeatedly mentioned in this article. The dependence on the number is one of the reasons why we created a separate tool for manipulating such tokens.

    The second reason is “Ivan Bore the Girl, ordered to drag the diaper”, i.e. declensions. Interesting fact: in the Hungarian language, 17 declensions are the record holder among the languages ​​into which we translate the site and applications. For rare words and phrases, you can do with a plain text translation without software bindings. For frequently occurring words and phrases, it is useful to have a tool that gets the grammatically correct version. For example, the phrase “2 girls liked you” warms the soul not only with a pleasant fact of upcoming acquaintance, but also with a clear and understandable Russian language.

    Our tools allow you to perform two important operations. For developers - get the finished word or phrase in grammatically correct form (more precisely, a universal container). For translators, use these regular forms in translations of ordinary tokens. For example, the above token in the translation system in Russian will look like "Did you like {{users_num}} {{users_word # Dative}}". This gives us a certain freedom: the translator may, at his discretion, rephrase the token and change the case.

    This is a pretty good solution, but it requires interaction between translators and developers. Now we are working on a system that will allow you to change all or part of the token based on the variables available in it without the participation of developers, only by translators.

    Token Context and Length


    Often the same phrase (not to mention individual words) can be translated in different ways depending on the context. A short search can be either the noun “search” or the verb “search”. In pursuit of the reuse of identical tokens and translations, it is important to keep an eye on the context. To help translators correctly understand the context of a phrase, we usually use a screenshot of an example of using a token. We even created a system for automatically collecting screenshots at the stage of testing the task, but more on that in a separate article.

    When working with mobile projects, you need to pay special attention to the length of the lines. Space on the screen will be in short supply, and you need to make sure that the text fragment fits in the space assigned to it. Often, a term that is one word in English can be a whole sentence in other languages. The length of the lines can be limited by both characters and pixels (if the font size and type are known in advance and rarely change).

    The restriction on the length of a translation is usually advisory in nature. If the limit is exceeded, the translator will see a warning, but can still save the translation.

    Multiversion and Resiliency


    When you have more than a hundred developers in your team, this requires some caution when working with translations: the same template with translated text (as well as the dictionary of a mobile application) can be changed in different tasks. The translation system must be able to distinguish between different versions of files and understand which translation needs to be given to the user.

    For a large team, it is also important to make the translation system as convenient and fault-tolerant as possible. Convenience allows new team members to get to work as quickly as possible. Fault tolerance is needed to reduce the influence of the human factor: the system must independently cope with human errors and either correct them where possible, or swear loudly and be shocked.

    Let users translate


    You can search for translators for staff or freelance for a long time and painfully, think up a quality control system for translations and suffer in every way every time you want to add support for a new language. But if your application is entertaining in nature and the audience is large enough, then it is perfectly acceptable to attract users to translations. This is how Facebook and WhatsApp translate, and recently, Badoo has been translated.
    We attach great importance to the quality of translations, and we were scared to run such a scheme. However, this approach has several strengths:
    • You do not need to look for translators in all languages ​​in the state;
    • native speakers themselves control the quality of translations;
    • it's free.

    We encourage the most active participants, but for the most part, users work for the idea of ​​making Badoo available in their native language. Currently, users work with seven languages, of which three (Finnish, Malay, and Vietnamese) are already available to the entire Badoo community. Translation into the remaining four (Basque, Bengali, Icelandic and Swahili) is still not good enough to include it in support for all users, but it is a matter of time.

    Conclusion


    The purpose of localization is to make users feel comfortable in your application, regardless of language and place of residence. Often this requires non-obvious and complex solutions, but based on our seven years of experience, we can safely say that it's worth it.
    The translation system at Badoo has been built all these years and continues to evolve. In the future we will try to tell in more detail about our technical and organizational solutions. What will be the next article - you decide!

    Gleb Deikalo, PHP developer

    Only registered users can participate in the survey. Please come in.

    What would you be most interested in learning from the following articles?

    • 39.6% Flow of development and translation. What the process looks like from the point of view of the product manager: GIT, JIRA, hooks and automation. 119
    • 59.6% The core of the system: parsing files, providing multiversionnost, generating translated content. 179
    • 37.3% Dependence on number and declension: the ability to rephrase. 112
    • 35.6% One style and less work: TranslationMemory. 107
    • 30.3% Translations by users: reduce the staff of translators and introduce new languages ​​for free. 91

    Also popular now: