FractalizeR November 17, 2009 at 14:53

PHP: Defining a language for text using N-grams. Part 2

Transfer

The second part of an article by Jan Barber on the definition of the language of the text using PHP. The first part can be found here .

It was necessary to split into two parts because of the large volume of text with formatting (“Some error ... We know ...).

Unfortunately, there wasn’t a lot of materials for building models on my computer, but OSX multilingual dictionaries came in handy. Having removed XML tags using strip_tags, I got clear text.

$lang = new LangDetector(); $dir = "/Library/Dictionaries/Apple Dictionary.dictionary/Contents/Resources/"; $dutch = strip_tags(file_get_contents($dir . "Dutch.lproj/Body.data")); $lang->adddocument($dutch, 'dutch'); $english = strip_tags(file_get_contents($dir . "English.lproj/Body.data")); $lang->adddocument($english, 'english'); $finnish = strip_tags(file_get_contents($dir . "fi.lproj/Body.data")); $lang->adddocument($finnish, 'finnish'); $spanish = strip_tags(file_get_contents($dir . "Spanish.lproj/Body.data")); $lang->adddocument($spanish, 'spanish'); $italian = strip_tags(file_get_contents($dir . "Italian.lproj/Body.data")); $lang->adddocument($italian, 'italian'); $french = strip_tags(file_get_contents($dir . "French.lproj/Body.data")); $lang->adddocument($french, 'french'); $swedish = strip_tags(file_get_contents($dir . "sv.lproj/Body.data")); $lang->adddocument($swedish, 'swedish'); ?>

Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.

With the built index, we can now test a large number of texts in various languages to make sure the recognition accuracy. Many thanks to Lorenzo, Soila (who speaks a huge number of different languages) and Ivo for the examples provided:

$italian = " Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura ché la diritta via era smarrita. "; echo $italian, "\n", "is ", $lang->detect($italian), "\n"; $finnish = " Suomalainen on sellainen, joka vastaa kun ei kysytä, kysyy kun ei vastata, ei vastaa kun kysytään, sellainen, joka eksyy tieltä, huutaa rannalla ja vastarannalla huutaa toinen samanlainen. "; echo $finnish, "\n", "is ", $lang->detect($finnish), "\n"; $dutch = " zoals het klokje thuis tikt, tikt het nergens "; echo $dutch, "\n", "is ", $lang->detect($dutch), "\n"; $spanish = " Por qué los inmensos aviones No se pasean com sus hijos? Cuál es el pájaro amarillo Que llena el nido de limones? Por qué no enseñan a sacar Miel del sol a los helicópteros? "; echo $spanish, "\n", "is ", $lang->detect($spanish), "\n"; $swedish = " Och knyttet tog av skorna och suckade och sa: hur kan det kännas sorgesamt fast allting är så bra? Men vem ska trösta knyttet med att säga: lilla vän, vad gör man med en snäcka om man ej får visa den? "; echo $swedish, "\n", "is ", $lang->detect($swedish), "\n"; ?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.

As you can easily see (the result of the script is slightly cropped to shorten the presentation), each language was correctly defined: A similar operation can be done with entire websites by removing HTML tags using strip_tag. The goal was the sites of three local offices of Ibuildings:

Nel mezzo del cammin... is italian Suomalainen on sellainen... is finnish zoals het klokje thuis tikt, tikt het nergens is dutch Por que los inmensos... is spanish Och knyttet tog av... is swedish

$nl = strip_tags(file_get_contents('www.ibuildings.nl')); echo "IB NL reads as: " . $lang->detect($nl), "\n"; $uk = strip_tags(file_get_contents('www.ibuildings.co.uk')); echo "IB Uk reads as: " . $lang->detect($uk), "\n"; $it = strip_tags(file_get_contents('www.ibuildings.it')); echo "IB IT reads as: " . $lang->detect($it), "\n"; ?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.

It seems that the page in the NL domain still has more English text than Danish, so it is defined as English. However, everything went fine with Italian:

IB NL reads as: english IB UK reads as: english IB IT reads as: italian

Other methods

Although the trigram method is very convenient and simple, it is by no means necessarily the best for use in every situation. For example, if you need a method that works without prior training or with minimal memory, you can simply compile a list of short, frequently occurring words in each language (like articles and prepositions) and look for only them in a given text.

Similarly, searching for Unicode characters unique to a given language can give you sufficient accuracy to determine it.

PEAR: Text_LanguageDetect

When Lorenzo and I discussed this problem, he mentioned that a package for determining the language of the text is already included in the PEAR library, albeit in the alpha version. He also uses the trigram method, but has somewhat richer capabilities. As expected, it is quite easy to use, supports Unicode and comes with a ready-made trigram base for some languages, so it almost does not need training. To complete the picture, we tried to determine with its help the languages of the same text fragments:

require_once 'Text/LanguageDetect.php'; function detect($text, $l) { $result = $l->detect($text, 1); if (PEAR::isError($result)) { return $result->getMessage(); } else { return key($result); } } $l = new Text_LanguageDetect(); $italian = " Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura ché la diritta via era smarrita. "; echo $italian, "\n", "is ", detect($italian, $l), "\n"; // ...остаток удален для краткости, но результат аналогичен ?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.

As expected, the result was similar. The package can be easily installed from PEAR using the command

pear -d preferred_state=alpha install Text_LanguageDetect

Tags:
n-gram
n grams
language definition
text language
php
Barber
Ian Barber

PHP: Defining a language for text using N-grams. Part 2

Other methods

PEAR: Text_LanguageDetect

Also popular now: