PHP: Defining a language for text using N-grams. Part 2
- Transfer
The second part of an article by Jan Barber on the definition of the language of the text using PHP. The first part can be found here .
It was necessary to split into two parts because of the large volume of text with formatting (“Some error ... We know ...).
Unfortunately, there wasn’t a lot of materials for building models on my computer, but OSX multilingual dictionaries came in handy. Having removed XML tags using strip_tags, I got clear text.
With the built index, we can now test a large number of texts in various languages to make sure the recognition accuracy. Many thanks to Lorenzo, Soila (who speaks a huge number of different languages) and Ivo for the examples provided:
As you can easily see (the result of the script is slightly cropped to shorten the presentation), each language was correctly defined: A similar operation can be done with entire websites by removing HTML tags using strip_tag. The goal was the sites of three local offices of Ibuildings:
It seems that the page in the NL domain still has more English text than Danish, so it is defined as English. However, everything went fine with Italian:
Although the trigram method is very convenient and simple, it is by no means necessarily the best for use in every situation. For example, if you need a method that works without prior training or with minimal memory, you can simply compile a list of short, frequently occurring words in each language (like articles and prepositions) and look for only them in a given text.
Similarly, searching for Unicode characters unique to a given language can give you sufficient accuracy to determine it.
When Lorenzo and I discussed this problem, he mentioned that a package for determining the language of the text is already included in the PEAR library, albeit in the alpha version. He also uses the trigram method, but has somewhat richer capabilities. As expected, it is quite easy to use, supports Unicode and comes with a ready-made trigram base for some languages, so it almost does not need training. To complete the picture, we tried to determine with its help the languages of the same text fragments:
As expected, the result was similar. The package can be easily installed from PEAR using the command
It was necessary to split into two parts because of the large volume of text with formatting (“Some error ... We know ...).
Unfortunately, there wasn’t a lot of materials for building models on my computer, but OSX multilingual dictionaries came in handy. Having removed XML tags using strip_tags, I got clear text.
$lang = new LangDetector();
$dir = "/Library/Dictionaries/Apple Dictionary.dictionary/Contents/Resources/";
$dutch = strip_tags(file_get_contents($dir . "Dutch.lproj/Body.data"));
$lang->adddocument($dutch, 'dutch');
$english = strip_tags(file_get_contents($dir . "English.lproj/Body.data"));
$lang->adddocument($english, 'english');
$finnish = strip_tags(file_get_contents($dir . "fi.lproj/Body.data"));
$lang->adddocument($finnish, 'finnish');
$spanish = strip_tags(file_get_contents($dir . "Spanish.lproj/Body.data"));
$lang->adddocument($spanish, 'spanish');
$italian = strip_tags(file_get_contents($dir . "Italian.lproj/Body.data"));
$lang->adddocument($italian, 'italian');
$french = strip_tags(file_get_contents($dir . "French.lproj/Body.data"));
$lang->adddocument($french, 'french');
$swedish = strip_tags(file_get_contents($dir . "sv.lproj/Body.data"));
$lang->adddocument($swedish, 'swedish');
?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.
With the built index, we can now test a large number of texts in various languages to make sure the recognition accuracy. Many thanks to Lorenzo, Soila (who speaks a huge number of different languages) and Ivo for the examples provided:
$italian = "
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
ché la diritta via era smarrita.
";
echo $italian, "\n", "is ", $lang->detect($italian), "\n";
$finnish = "
Suomalainen on sellainen, joka vastaa kun ei kysytä,
kysyy kun ei vastata, ei vastaa kun kysytään,
sellainen, joka eksyy tieltä, huutaa rannalla
ja vastarannalla huutaa toinen samanlainen.
";
echo $finnish, "\n", "is ", $lang->detect($finnish), "\n";
$dutch = "
zoals het klokje thuis tikt, tikt het nergens
";
echo $dutch, "\n", "is ", $lang->detect($dutch), "\n";
$spanish = "
Por qué los inmensos aviones
No se pasean com sus hijos?
Cuál es el pájaro amarillo
Que llena el nido de limones?
Por qué no enseñan a sacar
Miel del sol a los helicópteros?
";
echo $spanish, "\n", "is ", $lang->detect($spanish), "\n";
$swedish = "
Och knyttet tog av skorna och suckade och sa:
hur kan det kännas sorgesamt fast allting är så bra?
Men vem ska trösta knyttet med att säga: lilla vän,
vad gör man med en snäcka om man ej får visa den?
";
echo $swedish, "\n", "is ", $lang->detect($swedish), "\n";
?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.
As you can easily see (the result of the script is slightly cropped to shorten the presentation), each language was correctly defined: A similar operation can be done with entire websites by removing HTML tags using strip_tag. The goal was the sites of three local offices of Ibuildings:
Nel mezzo del cammin...
is italian
Suomalainen on sellainen...
is finnish
zoals het klokje thuis tikt, tikt het nergens
is dutch
Por que los inmensos...
is spanish
Och knyttet tog av...
is swedish
$nl = strip_tags(file_get_contents('www.ibuildings.nl'));
echo "IB NL reads as: " . $lang->detect($nl), "\n";
$uk = strip_tags(file_get_contents('www.ibuildings.co.uk'));
echo "IB Uk reads as: " . $lang->detect($uk), "\n";
$it = strip_tags(file_get_contents('www.ibuildings.it'));
echo "IB IT reads as: " . $lang->detect($it), "\n";
?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.
It seems that the page in the NL domain still has more English text than Danish, so it is defined as English. However, everything went fine with Italian:
IB NL reads as: english
IB UK reads as: english
IB IT reads as: italian
Other methods
Although the trigram method is very convenient and simple, it is by no means necessarily the best for use in every situation. For example, if you need a method that works without prior training or with minimal memory, you can simply compile a list of short, frequently occurring words in each language (like articles and prepositions) and look for only them in a given text.
Similarly, searching for Unicode characters unique to a given language can give you sufficient accuracy to determine it.
PEAR: Text_LanguageDetect
When Lorenzo and I discussed this problem, he mentioned that a package for determining the language of the text is already included in the PEAR library, albeit in the alpha version. He also uses the trigram method, but has somewhat richer capabilities. As expected, it is quite easy to use, supports Unicode and comes with a ready-made trigram base for some languages, so it almost does not need training. To complete the picture, we tried to determine with its help the languages of the same text fragments:
require_once 'Text/LanguageDetect.php';
function detect($text, $l) {
$result = $l->detect($text, 1);
if (PEAR::isError($result)) {
return $result->getMessage();
} else {
return key($result);
}
}
$l = new Text_LanguageDetect();
$italian = "
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
ché la diritta via era smarrita.
";
echo $italian, "\n", "is ", detect($italian, $l), "\n";
// ...остаток удален для краткости, но результат аналогичен
?>
Этот исходный код отформатирован с помощью FractalizeR's HabraSyntax Source Code Highlighter.
As expected, the result was similar. The package can be easily installed from PEAR using the command
pear -d preferred_state=alpha install Text_LanguageDetect