b0noII November 11, 2012 at 00:50

Text Mining Framework (Java)

What is it and for whom (instead of joining)

In this article I would like to talk about the small results of my research activities in the field of Text Mining. These very “results” were a small FrameWork, which, so far, is not very good, but we are growing =). This project is the implementation in practice of some of the theoretical provisions I have developed. As a consequence of this, I present the opportunities that he may potentially have at the end of the implementation of all ideas. This creation is called: “Text Mining FrameWork” (TextMF). Let’s briefly review what exactly TextMF will allow in its first final version and what works right now.

Should be in the final version :

Statistical text analysis;
Search for all words and word forms of each word in the text;
Ranking words by weight in the text;
Search for the subjects in the text in question;
Links between subjects in the text (direct and non-direct links);
Text abstracting;
Definition of the subject of the text;
Language training;
Organization of interaction with the user by means of communication (chat).

Already implemented (partially available or undergoing testing) :

Statistical analysis of the text (so far implemented very partially);
Search for all words and word forms in the text;
Sort words by their weight in a given text;
Search for persons in the text;
Definition of the topic of the text (testing and alignment of formulas).

Why another text processing library?

The fact is that the goal of this project is not to create a tool using which you can implement any kind of text processing algorithm (such as Python NLTK and similar ones), but to make it possible to use ready-made algorithms. And at the same time to test your own algorithm in practice. Those. this is not another statistical analyzer or a set of containers optimized for working with text data. Not! This is a set of heuristics that will work out of the box without the need for additional knowledge.

What inputs does TextMF work with: so far only text files. Needless to say, further support for much larger input formats is planned. It is also planned to make integration with the Web, so that it would be possible to calmly analyze Web pages.

Appearances and Passwords

The project is distributed through the BitBucket repository .

Bend it to yourself and connect to your project =) Everything is extremely simple. Soon, assemblies in the form of a plugin jar will be available.

Usage example

Word processing often takes a lot of time, especially if you try to open a whole book! So in order to try it out, I highly recommend limiting yourself to a few page texts from sites. However, very small texts can also give not very good results, due to the lack of information in them.

As mentioned earlier, the main idea is to maximize ease of use and hiding heuristics and algorithms. So everything is trite:

// Открываем и парсим текстовый файл, который лежит по адресу TEXT_FILE_NAME
Text text = new Text(TEXT_FILE_NAME); 
// Получаем список слов отранжированных по весу
List words = text.getWords();
// Получаем тему текста
List theme = text.getThem();
// Получаем первое слово в списке слов
Word word = words.get(0);
// Получаем лист всех словоформ
List wordForms =word.getWordForms();
// Получаем количество вхождений слова в текст
long count = word.getCount();
// Получаем все персоны, которые встречаются в тексте
List objects = text.getObjects();
// Смотрим вес слова
double weight = text.getWordWeight(word);

I repeat, getting a topic is a rather long procedure, so be careful when calling this method;) The asynchronous method of getting a topic will be implemented by itself, but later. It is also VERY important to note that the quality of the work of methods grows depending on the size of the text submitted. The more information there is, as a rule, the greater the opportunity to learn a language. However, the opening time of files increases significantly, with an increase in the size of the content.

Small UI program

To illustrate some of the program’s features, my colleague named Andrei whipped up a small UI client. At the current stage, it is just for guidance only, as it is sometimes more convenient to use it. It is written in Java FX, and is not yet distributed as a separate jar file. In order to “feel” it, you need to collect it = (.

The main program window:

1) Menu for selecting text for processing;
2) List of selected files;
3) The results of the work:
a) the word found in the text;
b) the weight of the word in the text;
c) the number of repetitions in the text
4) A field for displaying the subject of the text;
5) List of word forms.

Let's see what we can find out using our program for this text: The owners of Volga and Muscovites will be given another year :

The topic was searched for about a minute (long, I agree). When choosing a single word, you can see its word form:

or here:

Now let's try another text: “The aliens have abducted a family of Ukrainians and told about the future of earthlings! ”, Probably one of the most“ yellow ”texts =):

The text opened for a long time, probably a minute, I searched for the topic somewhere in the same way. Of course, a text topic should be understood as a chain of words that the algorithm considers as a topic of text. In the future, the algorithm will be able to produce output in a readable form, but this is the future, and now

we need your help

Of course, we really, really need your help! There are a lot of tasks, but the project is free. Tasks start from the simplest ones: deal with the site, write examples, document code, to the most hardcore ones: help in optimizing the mat. apparatus and refine it. Now, for example, it would be nice for someone to take on the issue of expanding the input formats and do something more than just a text file. Assistance in testing is also very important. The project has a domain:www.textmf.com , but it’s empty there, and I would be very happy if someone helped to fix it =)

For any offers of cooperation, please contact here: Viacheslav@b0noI.com Immediate

plans

From what will happen in the near future ( I think within a month or two) with the project:

add jar file assembly;
the project will be divided into the core and UI, i.e. one more repository will be added;
the implementation of long-term memory will begin;
analysis of relationships between persons;
it will be possible to summarize the text;
creating a self-contained jar with a UI.

Distant plans

Now TextMF has become a semi-finalist of the projectwww.ukrinnovation.com . So there is, albeit a small, but still a chance to get development investment.

I know that so far these are dreams, but if they asked me what functionality I see at the end, I would answer: a library using which you can write a chat bot that will pass the Turing test. If to speak more real, then most likely engines for dynamic tracking of information on the Internet. Track links and monitor their changes. Well, of course, something to create any local search engines.

The idea itself has great potential, here are spam filters, search engines, and automatic referenced systems, and many, many more things that can be built on the basis of such a framework.

TextMF authors:
Your humble servant Vyacheslav V Kovalevsky and
UI developer Andrey Prischepa (vinglfm@gmail.com)

Tags:

Text Mining Framework (Java)

Also popular now: