Simply and accurately determine the language of messages

  • Tutorial

In our company, YouScan processes about 100 million messages per day, on which many rules and various smart functions are applied. For their correct operation, you need to correctly define the language, because not all functions can be made agnostic with respect to the language. In this article, we briefly describe our study of this problem and show the quality assessment on dataset from social. networks.


Article layout


  1. Language problems
  2. Available public solutions
    • Compact Language Detector 2
    • Fasttext
  3. Quality control
  4. findings

1. Problems of language determination


The definition of a language is a rather old problem and many try to solve it within the framework of the multilingual nature of their products. Older approaches use solutions based on n-grams, when the number of occurrences of a certain n-gram is counted and on the basis of this is calculated “fast” for each language, after which the most probable language is chosen according to our model. The main drawback of these models is that the context is absolutely not taken into account, therefore the definition of a language for similar language groups is difficult. But because of the simplicity of the models, we end up with a high definition rate, which saves resources for high-loaded systems. Another option, more modern, is a solution on recurrent neural networks. This decision is already built not only on n-grams, but also takes into account the context, which should give an increase in the quality of work.


The complexity of creating your own decision rests on collecting data for learning and the learning process itself. The most obvious way out is to teach the model on Wikipedia articles, because we know the language exactly and there are very high quality, verified texts that are relatively easy to assemble. And to train your model, you need to spend a lot of time to build datasets, process them, and then choose the best architecture. Most likely someone already did it before us. In the next block, we will look at existing solutions.


2. Available public solutions


Compact Language Detector 2


CLD2 is a probabilistic model based on machine learning (Naive Baess classifier), which can define 83 different languages ​​for text in UTF-8 or html / xml format. For mixed languages, the model returns the top 3 languages, where the probability is calculated as an approximate percentage of the text of the total number of bytes. If the model is not sure of its response, then returns the "unc" tag.


The accuracy and completeness of this model is quite good, but the main advantage is speed. The creators say about 30kb in 1ms, we received from 21 to 26kb in 1ms (70000-85000 messages per second, the average size of which is 0.8kb, and the median - 0.3kb) on our Python wrapper tests.


This solution is very easy to use. First you need to install its Python wrapper or use our docker .


To make a prediction, simply import the library pycld2and write one additional line of code:


Language Definition with cld2
import pycld2 as cld2
cld2.detect("Bonjour, Habr!")
# (True,# 14,# (('FRENCH', 'fr', 92, 1102.0),#  ('Unknown', 'un', 0, 0.0),#  ('Unknown', 'un', 0, 0.0)))

The detector response is a tuple of three elements:


  • language is determined or not;
  • number of characters;
  • tuple of the three most probable languages, where the full name is in the first place,
    the second is the abbreviation of ISO 3166 Codes, the third is the percentage of characters belonging to the language, and the fourth is the number of bytes.

Fasttext


FastText is a Facebook library for efficiently learning and classifying texts. Within the framework of this project, Facebook Research presented embeddings for 157 languages, which show state-of-the-art results on different tasks, as well as a model for determining the language and other supervisory tasks.


For the language definition model, they used data from Wikipedia, Tatoeba and SETimes, and as a classifier - their fast text solution.


Facebook research developers provide two models:


  • lid.176.bin , which is slightly faster and more accurate than the second model, but weighs 128MB;
  • lid.176.ftz is a compressed version of the original model.

To use these models in python, you must first install Python wrapper for fasttext . It may be difficult to install, so you need to carefully follow the instructions on the githaba or use our docker . And you also need to download the model from the link above. We will use the original version in this article.


It is a bit more difficult to classify a language using the Facebook model; for this we need three lines of code:


Language Definition with the FastText Model
from pyfasttext import FastText
model = FastText('../model/lid.176.bin')
model.predict_proba(["Bonjour, Habr!"], 3)
#[[('fr', 0.7602248429835308),#  ('en', 0.05550386696556002),#  ('ca', 0.04721488914800802)]]

The FastText model allows you to predict the probability for n-languages, where by default n = 1, but in this example, we derived the result for the top 3 languages. For this model, this is already the general probability of predicting the language for the text, and not the number of characters that belong to a particular language, as was the case with the cld2 model. The speed of work is also quite high - more than 60,000 messages per second.


3. Quality assessment


We will evaluate the quality of the algorithms on data from social networks for random time taken from the YouScan system (approximately 500 thousand mentions), so the sample will contain more Russian and English, 43% and 32%, respectively, of Ukrainian, Spanish and Portuguese - around 2% of each, from other languages ​​less than 1%. For the correct target, we will take markup through google translate, since at the moment Google is very good at not only translating, but also defining the language of the texts. Of course, its markup is not ideal, but in most cases it can be trusted.


Metrics for assessing the quality of a language definition are accuracy, completeness, and f1. Let's count them and display in the table:


Comparison of the quality of two algorithms
with open("../data/lang_data.txt", "r") as f:
    text_l, cld2_l, ft_l, g_l = [], [], [], []
    s = ''for i in f:
        s += i
        if' |end\n'in s:
            text, cld2, ft, g = s.strip().rsplit(" ||| ", 3)
            text_l.append(text)
            cld2_l.append(cld2)
            ft_l.append(ft)
            g_l.append(g.replace(" |end", ""))
            s=''
data = pd.DataFrame({"text": text_l, "cld2": cld2_l, "ft": ft_l, "google": g_l})
deflang_summary(lang, col):
    prec = (data.loc[data[col] == lang, "google"] == data.loc[data[col] == lang, col]).mean()
    rec = (data.loc[data["google"] == lang, "google"] == data.loc[data["google"] == lang, col]).mean()
    return round(prec, 3), round(rec, 3), round(2*prec*rec / (prec + rec),3)
results = {}
for approach in ["cld2", "ft"]:
    results[approach] = {}
    for l in data["google"].value_counts().index[:20]:
        results[approach][l] = lang_summary(l, approach)
res = pd.DataFrame.from_dict(results)
res["cld2_prec"], res["cld2_rec"], res["cld2_f1"] = res["cld2"].apply(lambda x: [x[0], x[1], x[2]]).str
res["ft_prec"], res["ft_rec"], res["ft_f1"] = res["ft"].apply(lambda x: [x[0], x[1], x[2]]).str
res.drop(columns=["cld2", "ft"], inplace=True)
arrays = [['cld2', 'cld2', 'cld2', 'ft', 'ft', 'ft'],
          ['precision', 'recall', 'f1_score', 'precision', 'recall', 'f1_score']]
tuples = list(zip(*arrays))
res.columns = pd.MultiIndex.from_tuples(tuples, names=["approach", "metrics"])

modelcld2ftans
metricsprecrecf1precrecf1precrecf1
ar0.9920.7250.8380.9180.6970.7930.9680.7880.869
az0.950.7520.8390.8880.5470.6770.9140.7870.845
bg0.5290.1360.2170.2860.1780.2190.4080.2140.281
en0.9490.8440.8940.8850.8690.8770.9120.9250.918
es0.9870.6530.7860.7090.8140.7580.8280.8340.831
fr0.9910.7130.8290.530.8030.6380.7130.810.758
id0.7630.5430.6340.4810.4040.4390.6590.6030.63
it0.9750.4660.6310.5190.7780.6220.6660.7520.706
ja0.9940.8990.9440.6020.8420.7020.8470.9050.875
ka0.9620.9950.9790.9590.9050.9310.9580.9950.976
kk0.9080.6530.7590.8040.5840.6770.8310.7130.767
ko0.9840.8860.9330.940.7040.8050.9660.910.937
ms0.8010.5780.6720.3690.1010.1590.730.5860.65
pt0.9680.7530.8470.8050.7710.7880.8670.8640.865
ru0.9870.8090.8890.9360.9330.9350.9530.9480.95
sr0.0930.1140.1030.1740.1030.130.1060.160.128
th0.9890.9860.9870.9730.9270.950.9790.9860.983
tr0.9610.6390.7680.6070.730.6630.7690.7640.767
uk0.9490.6710.7860.6150.7330.6690.7740.7770.775
uz0.6660.5120.5790.770.1690.2780.6550.5410.592

The results clearly show that the cld2 approach has a very high accuracy of language detection, only for unpopular languages ​​it falls below 90%, and in 90% of cases the result is better than that of fasttext. With approximately the same fullness for the two approaches, f1 is faster than cld2.
The peculiarity of the cld2 model is that it issues a forecast only for those messages where it is confident enough that this explains high accuracy. The fasttext model gives the answer for most messages, so the accuracy is much lower, but it is strange that the completeness is not significantly higher, and in half of the cases - lower. But if you “twist” the threshold for the fasttext model, you can improve the accuracy.


4. Conclusions


In general, both models give a good result and can be used to solve the problem of determining the language in different domains. Their main advantage is high speed, which makes it possible to make a so-called "ensemble" and add the necessary preprocessing to improve quality.


All the code for playing experiments and testing the above approaches can be found in our repository .


You can also look at testing these solutions in another article that compares accuracy and speed in 6 Western European languages.


Also popular now: