Work with foreign texts. How to increase the percentage of understanding and learn a language?

In life or at work, sometimes one has to deal with texts in a foreign language, the knowledge of which is still far from perfect. To read and understand what was being discussed (and, at best, to learn a few new words), I usually used two options. The first is the translation of the text in the browser, the second is the translation of each word individually using, for example, ABBYY Lingvo. But these methods have many disadvantages. Firstly, the browser offers translation sentences, which means that it can change the order of words and the translation may be even more incomprehensible than the original text. Secondly, the browser offers neither alternative translation options, nor synonyms for words, which means that learning new words becomes problematic. Other options and synonyms can be obtained by searching for a specific word in the translator, and this takes some time, especially if there are a lot of such words. Finally, while reading the text, I would like to know which words are most popular in this language, so that I can remember them and then use them in my written or colloquial speech.

I thought that having such a “translator” on hand would be nice, and so I decided to implement it in python. All who are interested, I ask under the cat.

Word count


When writing the program, I was guided by the following logic. First you need to rewrite the entire text with lowercase letters, remove unnecessary signs and symbols (.?!, Etc.) and calculate how many times each word appears in the text. Inspired by the code from Google , I did it without the slightest difficulty, but I decided to write down the results in a slightly different form, namely {1: [group of words that is with frequency 1], 2: [--//-- with frequency 2], etc.}. This is convenient if sorting is required, including within each group of words, for example, if we want the words to go in the same order as in the text. In total, I want to get a double sorting: so that the most frequently occurring words go in the beginning, and if they occur with the same frequency, then they are ordered in accordance with the source text. This idea is reflected in the following code.

def word_count_dict(filename, dictList=de500):   
    count = {}
    txt = re.sub('[,.!?":;()*]', '',
            open(filename, 'r').read().lower())
    words = txt.split()
    for word in words:
      if not word in count:
        count[word] = 1
      else:
        count[word] += 1
    return {i: sorted([w for w in count
               if (count[w]==i and w not in dictList.values())],
               key=lambda x: txt.index(x)) for i in set(count.values())}

Well, everything works as I wanted, but there is a suspicion that the top of the list will contain auxiliary words (such as the) and others whose translation is obvious (for example, you). You can get rid of them by creating a special list of the most used words, so that when creating a dictionary, exclude all words that are in this list. Why is it still convenient? Because, having learned the right word, we can add it to the list, and the corresponding translation will no longer be shown. We denote the list of the dictList variable and forget about it for a while.

Translation of words


Having spent several minutes searching for a convenient online translator, it was decided to check Google and Yandex in action. Since Google closed the Translate API exactly 3 years and 1 day ago , we will use the workaround proposed by WNeZRoS . In response to a request for a particular word, Google offers a translation, alternative translation options and their reverse translation (that is, synonyms). Using Yandex'a as usual requires a key, and in the response to the request you can find not only the translation, but also examples, and probably something else. In both cases, the answer will contain a list in json format, quite simple for Google, and somewhat complicated for Yandex. For this reason, and also because Google knows more languages ​​(and often words), it was decided to dwell on it.

We will send requests with the help of a wonderful grab library , and write the answers in an auxiliary text file ( dict.txt ). In it we will try to find the main translation, alternatives and synonyms, and if they are, print them. Let's make the last two options disabled. The corresponding code will look as follows.

def tranlsate(word, key, lan1='de', lan2='ru', alt=True, syn=True):
    g = Grab(log_file = 'dict.txt')
    link = 'http://translate.google.ru/translate_a/t?client=x&text='\
           + word + '&sl=' + lan1 + '&tl=' + lan2
    g.go(link)
    data = json.load(open('dict.txt'))
    translation, noun, alternatives, synonims = 0, 0, 0, 0
    try:
        translation = data[u'sentences'][0][u'trans']
        noun = data[u'dict'][0][u'pos']
        alternatives = data['dict'][0]['terms']
        synonims = data['dict'][0]['entry'][0]['reverse_translation']
    except:
        pass
    if lan1=='de' and noun==u'имя существительное':
        word = word.title()
    if translation:
        print ('['+str(key)+']', word, ': ', translation)
        if alt and alternatives:
            [print (i, end=', ') for i in alternatives]
            print ('\r')
        if syn and synonims:
            [print (i.encode('cp866', errors='replace'), end=', ')
                                     for i in synonims]
            print ('\n')

As you can see, the default translation is configured from German to Russian. The variable key corresponds to the frequency of the word in the text. We will transfer it from another function, which will cause a translation for each word.

Call the translation function


Everything is simple here: I want to get groups of words with the appropriate frequency in the form of a dictionary ( word_count_dict function ) and find a translation of each word ( tranlsate function ). I also want to show only the first n groups of the most used words.

def print_top(filename, n=100):
    mydict = word_count_dict(filename)
    mydict_keys = sorted(mydict, reverse=True)[0:n]
    [[tranlsate(word, key) for word in mydict[key]] for key in mydict_keys]


List of most used words


Well, the program is almost ready, it remains only to make a list of the most used words. They are easy to find on the Internet, and I compiled a list of the 50, 100 and 500 most used words in German and wrote it in a separate file so as not to litter the code.

If someone wants to make a similar list in English or another language, I will be grateful if he or she shares it so that I can add it to mine.

Preliminary results


By running the program, you can get the results in approximately the following form:

[частота повторения слова] слово: перевод
альтернативный перевод,
синонимы


Well, the code is written, the program works, but how convenient and efficient is it? To try to answer this question, I took a couple of German texts for verification.

The first article from Deutsche Welle deals with the topic of financing coal mining for Deutsche Bank near Australia. The article contains 498 words, of which the 15 most frequently encountered in the text (we use the list of the 50 most used German words for exclusion) correspond to 16.87% of the entire text. Roughly speaking, this means that if we assume that a person does not know these words, then after reading the translation 6.67% of all the words found in the text, his level of understanding will increase by almost 17% (if you measure the level of understanding only by the number of familiar words in the text) . At first glance, pretty good.

Secondan article from Spiegel talks about how the German DAX stock index reacted to Poroshenko’s victory in the presidential elections in Ukraine (yes, he grew). The article contains 252 words, of which 8 are most common (6.06%) similarly correspond to 11.9% of the text.

In addition, it should be noted that if the text to be translated is short enough so that each word occurs only once (for example, a message received by e-mail), then following the proposed translation in the same order as the words go in the text is very convenient.

It sounds beautiful (es klingt schön), but these are very rude tests since I have entered too many premises. I think that to check how this idea can facilitate working with texts in a foreign language is possible only with some regular use of this program, which, unfortunately, is not very convenient. In order to translate the text you want to copy it first to the .txt file, and name the file variable filename , and then run the function print_top .

What is missing?


Instead of a conclusion, I would like to reflect on what is missing at this stage, and how this could be improved.

Firstly, as has just been said, convenience. The code is inconvenient to use - you need to copy the text, + dependency on python and grab library. What to do? Alternatively, write a browser extension so that you can select a specific item on the page (for example, similar to how it is implemented in Reedy ) and receive its translation. Secondly, a list of words to exclude the most used in other languages. Finally, various schools with encodings are possible.

Most likely, in the near future, my hands will not reach the introduction of the above changes (since the code is written, it's time to start a deeper study of the language!), So if someone wants to join, I will be glad to the company and help.

The entire code can be found under the spoiler, as well as on github .

Source
# -*- coding: utf-8-sig -*-
from __future__ import print_function
import re
import json
from pprint import pprint
from grab import Grab
from dictDe import *
def tranlsate(word, key, lan1='de', lan2='ru', alt=True, syn=True):
    """Prints the number of counts, word, translation, and example
    from lan1 to lan2 according to Translate.Google."""
    # First, write down a translation in some auxiliary txt file
    # and load it in json format
    g = Grab(log_file = 'dict.txt')
    link = 'http://translate.google.ru/translate_a/t?client=x&text='\
           + word + '&sl=' + lan1 + '&tl=' + lan2
    g.go(link)
    data = json.load(open('dict.txt'))
    # Then, let's try to get all the necessary elements in json
    translation, noun, alternatives, synonims = 0, 0, 0, 0
    try:
        translation = data[u'sentences'][0][u'trans']
        noun = data[u'dict'][0][u'pos']
        alternatives = data['dict'][0]['terms']
        synonims = data['dict'][0]['entry'][0]['reverse_translation']
    except:
        pass
    # German nouns should begin with capital letter
    if lan1=='de' and noun==u'имя существительное':
        word = word.title()
    # Finally, print out counts, word, translation with alternatives
    # and synonims, if applicable. Encoding is added up to allow
    # printing in cmd if you have a russian version of Windows
    if translation:
        print ('['+str(key)+']', word, ': ', translation)
        if alt and alternatives:
            [print (i, end=', ') for i in alternatives]
            print ('\r')
        if syn and synonims:
            [print (i.encode('cp866', errors='replace'), end=', ')
                                     for i in synonims]
            print ('\n')
def word_count_dict(filename, dictList=de50):
    """Returns a dictionary with key being number of counts
    and value being a list of words with that key.
    dictList is an optional argument: it is to eliminate
    the most common words. Default is the dictionary of
    the 50 most common German words"""
    count = {}
    txt = open(filename, 'r').read().lower()
    txt = re.sub('[,.!?":;()*]', '', txt)
    words = txt.split()
    for word in words:
      if not word in count:
        count[word] = 1
      else:
        count[word] += 1
    return {i: sorted([w for w in count
               if (count[w]==i and w not in dictList.values())],
               key=lambda x: txt.index(x)) for i in set(count.values())}
def print_top(filename, n=100):
    """Generates the top count groups for the given file.
    Default number equals 10. Drop reverse if you want
    to print the less frequent words in the text."""
    mydict = word_count_dict(filename)
    mydict_keys = sorted(mydict, reverse=True)[0:n]
    [[tranlsate(word, key) for word in mydict[key]] for key in mydict_keys]
filename = 'dictext.txt'
print (print_top(filename))


Also popular now: