b0noII September 27, 2014 at 13:07

New Language Independent NLP Library

Introduction

Everyone who came to this world passed through the path of knowledge of the language. At the same time, a person learns a language not by the rules or grammar. Even, moreover, each person, as a child, first learns such a strange phenomenon as language, and only later, with age, begins to learn its rules (in kindergarten and school). This explains the funny fact that everyone who studies a foreign language in adulthood, when he is already less inclined to learn new languages, knows more about the subject of his study than most speakers of this language.

This simple observation makes it possible to assume that in order to understand a language it is not at all necessary to have knowledge of it. Empiry (experience) is enough, which can be gleaned from others. But this is exactly what almost all modern NLP libraries forget about, trying to build an all-encompassing language model.

For a clearer understanding, imagine yourself blind and deaf. And, even being born in this state, you could interact with the world and learn the language. It goes without saying that your idea of the world would be different than that of everyone around. But you could all interact in the same way with the world. There would be no one to explain to you what is happening and what is the language and just like that, tactfully analyzing the Braille you would have gradually moved off the ground.

And this means that to understand the message in any language, we do not need anything but the message itself. Provided that this message is large enough. It is this idea that underlies the library called AIF. For details, please visit Cat.

At first, quite a bit of theory about how dull all around

There is a very good Stanford NLP course: www.coursera.org/course/nlp . If for some reason they did not see him, then it is very in vain. After reviewing at least the first 2 weeks, it becomes clear what is the probabilistic model of the language on which most of all existing NLP solutions are built. In short, having a huge pile of texts, you can estimate the probability with which each word is used with another word. This is a very crude explanation, but it seems to me to accurately reflect the essence. As a result, it turns out to build more or less decent translations (hi Google Translate). This approach does not bring us closer to understanding the text, but only tries to find similar sentences and build a translation on their basis.

But let's not talk about sad things, let's talk about what we can potentially give:

What functions should the final version of our library have to implement?

Search for characters used to separate sentences in text.
Extracting lemmas from text (with weights).
Building a semantic graph of text.
Comparison of semantic graphs of texts.
Building a summary of the text.
Extract objects from text (partial NER).
Defining the relationship between objects.
Definition of the topic of text.
In the current version, we already have an implementation of some items from this list.

Why does the world need AIF?

Given that there are already quite a few such OpenNLP libraries, StanfordNLP, - why create another one?

Most existing NLP libraries have significant drawbacks:

attachment to specific languages (the quality of the result of work can vary greatly from language to language);
attachment to an accurate grammatical structure (it would be nice to see how everyone writes like Shakespeare or Tolstoy, but this is far from reality);
attachment to the encoding (since language models are often sharpened for a certain encoding).

In such libraries there is a very high correlation between the quality of the text supplied to the input and the result obtained at the output.

Language models cannot conduct semantic analysis of text. They avoid understanding the text at the parsing stage. A language model can help break down text into sentences, perform entity extraction (NER), feeling extraction. Nevertheless, the model cannot determine the meaning of the text; for example, it cannot compose an acceptable summary of the text.

We illustrate the above points with an example.

Take the scanned text https://archive.org/details/legendaryhistor00veld . This text has a number of non-standard encoded characters, but we will make it even worse by replacing the “.” Character with “¸.” This replacement will not interfere with readability for the average user, but it makes the text practically unreadable for NLP libraries.

Let's try to break this text into offers using libraries such as: OpenNLP, StanfordNLP and AIF:

As a result, the libraries were able to single out the following number of offers:

StanfordNLP: 13
OpenNLP: 3
AIF: 2240

But even simpler problems than this are often unsolvable for most NLP libraries. The main reason is that they are not so smart. They are based on models, which are a set of static rules and values. Changes to rules or values often require retraining of the model. And this is quite a long time and costly. Avoiding this (using language models) is the fundamental idea of our library.

AIF learns the language from input text. He does not need language models, since he receives all the necessary information about the language from the text itself. The only important requirement is that the input text must be more than 20 sentences.

So how does AIF break text into sentences?

To highlight the characters that divide the text into sentences, we have developed a special formula - for each character, the probability that it is a separator is calculated.

The results of calculating the probability that the symbol is used to separate sentences are given below.

Example 1 (The Legendary History of the Cross)

archive.org/details/legendaryhistor00veld

This chart displays the characters that are most likely to be used to separate sentences.

Example No. 2 (Punch, Or the London Charivari, Volume 107, December 8th, 1894)

www.gutenberg.org/ebooks/46816

This chart displays the characters that are most likely to be used to separate sentences.

Example No. 3 (William S. Burroughs. Naked lunch)

en.wikipedia.org/wiki/Naked_Lunch

This chart displays the characters that are most likely to be used to separate sentences.

Of course, the presence of such probabilities does not give the result itself. You still need to understand where is the limit that divides these characters into “separators” and “other characters” of sentences. You also need to be able to divide the characters into groups: those that divide the text into sentences and divide the sentence itself into parts.

The results can be easily reproduced using the CLI, which uses our library.

The simplest CLI for AIF

GitGub Link: github.com/b0noI/aif-cli/wiki
For download: s3.amazonaws.com/aif2/aif-cli/1.0/aif-cli.jar

You can use it as follows:

java -jar aif-cli.jar

For example, you can divide your text into sentences using the command:

java -jar aif-cli.jar —ssplit

Or tokens:

java -jar aif-cli.jar —tsplit

Or you can print the characters most likely to be sentence separators:

java -jar aif-cli.jar —ess

Using the AIF Library

You can start using our Alpha 1 library in your project. To do this, simply attach our Maven repository to the project. Instructions can be found here: github.com/b0noI/AIF2/wiki

At the moment, only two functions are available:

breaking text into tokens ( description );
breakdown of tokens into offers ( description ).

What is planned in the next version?

In the first Alpha, we do not divide the characters that are the dividers of sentences into groups, for example:

Group 1:.!?
Group 2: “; '()
Group 3:,:

While we are working with all the “separators”, as if they were all in group 1. Nevertheless, starting with the Alpha2 version, we will have a division into groups (quite right, our library can subdivide “separator characters” without a language model!)

Also in Alpha 2, we will introduce a lemmatization module that will extract lemmas from text. And again, this module will work completely regardless of the language! AIF will be able to extract lemmas from text, for example:

car, cars, car's, cars' => car

Since the possibility of semantic analysis WILL NOT be implemented in the Alpha 2 version, this means that we cannot extract lemmas like this:

am, are, is => be

But even such a task can be solved in a language-independent way. And it will be resolved in future releases.

What is planned in the next article?

comparative analysis of the quality of the breakdown into sentences with other key libraries;
a description of the algorithm for selecting characters that break the text into sentences;
description of the algorithm for dividing characters into groups (those that divide the text into sentences and sentences themselves).

Afterword

Of course, the current implementation does not work equally well with all languages. For example, Japanese text or languages that do not use spaces are still incomprehensible to AIF.

our team

Kovalevskyi Viacheslav - algorithm developer, architecture design, team lead (viacheslav@b0noi.com / @ b0noi )
Ifthikhan Nazeem - algorithm designer, architecture design, developer
Evgeniy Dolgikh (marcon@atsy.org.ua, marcon ) - QA assistance, junior developer
Siarhei Varachai - QA assistance, junior developer
Balenko Aleksey (podorozhnick@gmail.com) - worked on Sentence Splitters for tests (using Stanford NLP and AIF NLP), added tokenization support for CLI, junior developer
Sviatoslav Glushchenko - REST design and implementation, developer
Oleg Kozlovskyi QA (integration and qaulity testing), developer.

If you have an interesting NLP project, contact us;)

Project Links and Details

project language: JDK8
license: MIT license
issue tracker: github.com/b0noI/AIF2/issues
wiki: github.com/b0noI/AIF2/wiki
source code: github.com/b0noI/AIF2
developers mail list: aif2-dev@yahoogroups.com (subscribe: aif2-dev-subscribe@yahoogroups.com)

Afterword ^ 2

Honestly, the library is not a complete novelty. At the beginning of my candidate’s path, I already laid out some of the algorithms in raw form and even wrote an article about it on Habr . However, since then much water has flowed, many hypotheses have been confirmed, many have been rejected. There was an urgent need to write a new implementation that embodies the accumulated and tested hypotheses in the field of NLP.

Only this time it turned out to attract more developers to the project and we are trying to approach development more consistently than it was the last time. Plus, it turned out to be a very good project in which students of my Java course on Hackslet can get real experience in developing a Java project in a team;)

Tags: