Auto Review System for Three Languages

I want to talk about a service I developed for abstracting news texts in English, Russian and German.

Systems of automatic abstracting (summarization) (ATS) is a rather specific topic and will be interesting mainly to those who are engaged in automatic processing of the language. Although a perfectly executed self -izer could become a useful assistant in areas where it is necessary to overcome information overload and quickly decide what information is worth further consideration.

What is the situation?


On the one hand, in the process of searching for analogues, I noticed an interesting thing - most of the articles, services, repositories, etc. that I found are dated at the latest in 2012. There is an article on Habré on the topic of automatic abstracting, published in 2011. In the same year, news summaries were last included on the TAC conference track list .

On the other hand, mobile applications that process news feeds and provide the user with short essays on the topics he has chosen are gaining popularity. A vivid example of such a demand is the relatively recent (2013) purchase by Google and Yahoo of the Wavii and Summly startup startupsrespectively, as well as the presence of various browser plugins that abstract web pages ( Chrome , Mozilla ).

A quick test of free online abstracting services shows that most of them work similarly, yielding the same average (bad?) Results, among which, perhaps, Autosummarizer stands out for the better .


Why another SAR?


The initial goal of the project is to serve as a platform for learning programming in general and programming in python in particular. And since the topic of computer linguistics is close to me in terms of occupation, I chose abstracting as an object of development, and besides, there were already some ideas and materials on it.

If you look at the services from the above list, you can see that they mostly work with English texts (if they can be made to function at all). You can choose a different language in MEAD , OTS , Essential-mining , Aylien and Swesum. At the same time, the first one does not have a web interface, the third after 10 trial texts requires registration, and the fourth, giving the opportunity to set the settings in the demo, for some reason does not want to abstract anything.

As I got something good with processing English texts, I wanted to make a service that would work with Russian and German news articles and work no worse than those listed, as well as provide an opportunity to compare the developed algorithm with the TextRank methods popular today , LexRank, LSA and others . In addition, this is a good opportunity to practice with html, css and wsgi.

Where to look?


Project site: t-CONSPECTUS

How does it work?


t-CONSPECTUS is an extract type self-raiser, i.e. he forms an abstract from the sentences of the original article, which received the greatest weight in the analysis process, and, therefore, best convey the meaning of the content.

The whole process of summarization is carried out in four stages: text preprocessing, weighting of terms, weighing of sentences, extraction of significant sentences.

During pre-processing, the text is divided into paragraphs and sentences, a heading is found (needed to correct the weights of the terms), tokenization and stemming are carried out. For Russian and German languages ​​lemmatization is carried out. Pymorhpy2 lemmatizes Russian texts; to process German, I had to write my own lemmatization function based on the lexiconCDG parser, since neither NLTK, nor Pattern, nor TextBlob German, nor FreeLing provided the necessary functionality, and the selected hosting does not support Java, which precluded the possibility of using Stanford NLP.

In the term weighting step, keywords are determined using TF-IDF. The term receives an additional coefficient if:

  1. met in the title
  2. met in the first and last sentences of the paragraphs,
  3. met in exclamatory or interrogative sentences,
  4. is a proper name.

Weighing offers is carried out according to the method of symmetrical summarization.

A detailed description is given in the article “Yatsko V.A. Symmetric abstracting: theoretical foundations and methods // Scientific and technical information. Ser. 2. - 2002. - No. 5. ”

With this approach, the weight of the proposal is defined as the number of links between the proposal and the proposals to the left and right of it. Links are keywords common to this proposal and its neighbors. The sum of the left and right links is the weight of the proposal. There is a limitation - the text should consist of at least three sentences.

In addition, when calculating the final weight of the offer, its position in the text (the first sentence in the news texts is the most informative), the presence of proper and digital sequences and the length of the offer are taken into account. In addition, a penalty factor is applied, which reduces the weight of long sentences.

The specified number of significant sentences is selected from a list sorted in descending order of weight, while the extracted sentences are placed in the order in which they went in the original in order to at least somehow observe the coherence of the text. The default Sammari size is 20% of the original.

What is the quality of sammari?


The traditional approach to assessing the quality of sammari is a comparison with a human abstract. The ROUGE package is by far the most popular tool for conducting such an assessment.

Unfortunately, getting samples is not so easy, although, for example, the DUC conference provides the results of past competitions of self-raisers, including human essays, if you go through a number of bureaucratic procedures.

I chose two fully automatic evaluation metrics, justified and described in paragraph 3 here (pdf), which compare the Sammari with the original article. This is the cosine similarity and the Jensen – Shannon divergence .

Jenson-Shannon distance shows how much information will be lost if the original is replaced with an abstract. Accordingly, the closer the indicator is to zero, the better the quality.

The cosine coefficient, a classic for IR, shows how close the document vectors are to each other. For vectors, I used tf-idf words. Accordingly, the closer the indicator is to 1, the more the abstract corresponds to the original in terms of keyword density.

As systems for comparison were selected:

  • Open Text Summarizer , which works with my chosen three languages ​​and according to the developers of “Several academic publications have benchmarked it and praised it” ;
  • TextRank, a popular algorithm today in the implementation of Sumy ;
  • Random - it is fair to compare the algorithm with randomly selected sentences of the article.

For each language, 5 texts were selected by uninterested users from the areas of “popular science article” (popsci), “environment” (environ), “politics” (politics), “social sphere” (social), “information technology” (IT ) Evaluated 20% of abstracts.

Table 1. English:
 Cosine Similarity (-> 1)Jensen – Shannon divergence (-> 0)
 t-CONSPOTSTextrankRandomt-CONSPOTSTextrankRandom
  
popsci0.79810.77270.82270.51470.52530.42540.36070.4983
environ0.93420.93310.94020.76830.37420.37410.2940.4767
politics0.95740.92740.93940.58050.43250.41710.41250.5329
social0.73460.63810.55750.19620.37540.42860.55160.8643
IT0.87720.87610.92180.69570.35390.34250.33830.5285

Table 2. German:
 Cosine Similarity (-> 1)Jensen – Shannon divergence (-> 0)
 t-CONSPOTSTextrankRandomt-CONSPOTSTextrankRandom
  
popsci0.67070.65810.66990.49490.50090.4610.45350.5061
envir0.71480.67490.75120.22580.42180.48170.40280.6401
politics 0.73920.62790.69150.49710.44350.46020.41030.499
social0.6380.50150.56960.60460.46870.48810.4560.444
IT0.48580.52650.66310.43910.51460.5370.42690.485

Table 3. Russian language:
 Cosine Similarity (-> 1)Jensen – Shannon divergence (-> 0)
 t-CONSPOTSTextrankRandomt-CONSPOTSTextrankRandom
  
popsci0.60050.52230.54870.47890.46810.5130.51440.5967
environ0.87450.81000.81750.79110.3820.43010.40150.459
politics0.59170.50560.54280.49640.41640.45630.46610.477
social0.67290.62390.53370.60250.39460.45550.48210.4765
IT0.840.79820.80380.71850.50870.44610.41360.4926

The texts of the original articles and the received Sammari can be viewed here .

Here and here you can download third-party packages for automatic assessment of the quality of sammari.

What's next?


Further it is planned to improve the algorithm little by little, for example, to take into account synonyms when searching for article keywords, or to use for this purpose something like Latent Dirichlet Allocation ; decide which parts of the text need special weighting (for example, numbered lists); try to add more languages, etc. etc.

On the site itself, add quality indicators to the statistics, add a visual comparison of the results of the “native” algorithm and third-party ones, etc.

Thanks for attention!

Also popular now: