luksian May 11, 2011 at 05:47

Automatic text analysis without moderators

From the sandbox

Recently on Habré there was an article about automatic abstracting of articles . It so happened by chance that I, too, am engaged in automatic text analysis and have achieved some success in this.

I managed to ensure that the algorithm finds duplicate and similar content texts. It also automatically determines the proximity of the text to certain topics and extracts from the total mass those texts that make up some mainstream. That is, the reader does not have to sift through all the information in order to understand the main thing. With the increase in the volume of analyzed texts, all low-quality, uninteresting, obscene, irrelevant, etc. will be automatically eliminated.

The idea of the algorithm is that the text is divided into chains, then their comparative analysis is performed, special markers are selected, based on which decisions are made.

The analysis is fully automatic without a moderator and editor in chief. Because of this, the algorithm is sometimes mistaken and may place text in the wrong section, but the reason for this is rather that the original set of texts is usually grouped with even less care. Over time, the algorithm becomes more and more accurate, as over time enough statistical information accumulates.

That is not all. The algorithm is able to understand humor. If the text is knocked out of the general mass and shines with absurdities, then the algorithm will select it and mark it as "Humor". The algorithm finds jokes quite qualitatively, and if something is not funny, it is more likely that the algorithm does not work very long, only a few weeks. That is, he has not yet managed to understand that something is no longer funny.

Also in the automatic mode, you can find new ideas. For example, in the city of Kopeisk, the maternity hospital was connected to the Internet so that fathers would not stand under the windows and shout out their wives in an attempt to see the face of their child from afar, but looked at him through Skype. Or Yalta resort residents are advised to wear a whistle and arm themselves with gas canisters, since the city has opened the season of thefts and robberies. But Poland will advertise its apples in Russia with EU money.

Ideas are not always interesting, but with the accumulation of "experience" and this should be corrected. Those who wish can find a suitable idea among what is already being sought.

The current algorithm and its operation can be viewed on the nfos.ru website . The site is committed to collecting news from several sources, analyzing and publishing everything that it considers necessary to publish. Now I can brag that without straining I know all the main news. Which I wish you too.

For example, do you already know that they started a case on Navalny on suspicion of raiding? Or have you heard that a record number of Osama portraits have been sold in Pakistan?

I think that the algorithm will fit not only for the analysis of texts, but also for the analysis of images and other unstructured data or data with not obvious structure. For example, to filter out noise, to decrypt, analyze algorithms based on the results of their work, etc. etc. The algorithm is potentially suitable for predicting stock quotes and exchange rates, but I’m unlikely to get to all this in the near future, as there is not enough time.

The entire analysis algorithm fits in 40 kilobytes of PHP code, plus about 70 kilobytes of code for designing a news resource. Agree that for the functionality that appears, it's just a minuscule. But what the algorithm is really gluttonous for is the occupied space. For several weeks, more than 1.5 GB of information has accumulated in the database. And this volume is constantly growing.

The algorithm is practically insensitive to failures. If at some point in time inaccurate, distorted, false, bad information gets into the database, then it will either not affect further analysis, or its effect will become negligible over time.

Finally, I want to say that the analysis did not require a powerful hosting. All news from about 150 sources are analyzed on a cheap hosting FirstVDS-Acceleration which costs 249r / month. CPU time, of course, is not enough, but it made me optimize the algorithm, which I managed to do without visible losses.

Tags:

Automatic text analysis without moderators

Also popular now: