Automatic generation of meaningful unique texts

    Every web optimizer knows that in order for the site to be loved by search engines, it must contain unique texts. And not anyhow what sets of words, but meaningful sentences, preferably on the topic of the site. This is especially a problem for aggregators who take information from other sites, and online stores, where the parameters and data on products are generally the same. Therefore, the standard practice in this situation is to order unique texts for copywriters. The cost of such pleasure is from 50 to 300 rubles. for 1000 characters. If your site has 10,000 pages, then unique texts quickly become a significant expense.

    In this article we will talk about algorithmic text generation methods and talk about our experience working with them.

    We’ll clarify right away that we’ll focus on generating meaningful and useful texts, rather than text-like garbage, which can be easily created in huge quantities. The opinion is often expressed that this task cannot be solved automatically, but in practice this belief is already outdated.

    As a task, consider the automatic generation of product descriptions based on reviews. Those. having several user reviews of the product received from different sites, automatically create a small unique text summing up the information from the reviews. This task is more complicated than, say, generating text based on the characteristics of the product, because we must first extract some information from the reviews, and then create a new text based on it.

    Suppose we work with phone reviews. What information can we extract? At a superficial level, we can determine whether a review is positive or negative using a text classifier , and then extract a list of mentioned aspects of the phone. For example, the easiest way is to analyze the dictionary of occurrences of words, such as “convenience”, “screen”, “battery”, “volume”, etc. A more accurate way to highlight aspects and their ratings can be based on a trained system of extracting information from text .

    Thus, we can get data of the form {convenience: +, volume -, screen + ...}. Not a lot of information, but for starters it’ll do. Now you need to create the text. Let's see how this can be done.

    Patterns. The first thing that comes to mind is to use templates. Those. prepare in advance offers like “This phone is very convenient”, “Volume is good”, etc. Then go through the list of signs and insert the appropriate offers. For our example, we get something like.

    This phone is very convenient. The volume is poor. The screen is pretty good.

    The text is relatively meaningful, and more or less readable, but it will quickly cease to be unique, since the variety of options is small. This is bad for search engines, and the reader will be annoyed over time.

    Formal grammar. Imagine this set of rules:

    $ convenience ← $ phone $ conv
    $ phone ← $ this $ phone-ex
    $ conv ← $ mod $ conv-ex
    $ mod ← very
    $ mod ← enough
    $ mod ←
    $ phone-ex ← phone
    $ phone-ex ← device
    $ this ← this
    $ this ←
    $ conv-ex ← convenient $ use
    $ conv-ex ← convenient
    $ use ← to use
    $ use ←

    start with the topmost rule and we will substitute the values ​​of the characters on the right: $ convenience => $ phone $ conv => $ this $ phone-ex $ mod $ conv-ex => this device is quite convenient

    If you choose a rule for the next substitution randomly, you get different offers. For example, the same set of rules can generate: the phone is very convenient and this device is very convenient to use.

    This set of rules describes many different options for sentences and provides significantly greater variability. With a certain industriousness, you can write rules that will allow you to generate a variety of and quite readable texts.

    For an example, I will give a description of the phone generated in this way from reviewdot.ru

    We studied 295 reviews. There is reason to believe that such an amount is sufficient to obtain analysis. The vast majority of people are pretty with this phone, but there are some not-so-good opinions.

    Advantages: users who leave reviews, as a rule, highlight design and sufficient usability among the advantages. In addition to this, users whose reviews were found to be generally satisfied with the quality of the battery, volume, sound, camera, keyboard, case, plastic, strength, screen, are generally satisfied.
    Weaknesses: Reliability is commonly referred to as weaknesses.


    The disadvantages of this method are its limited vocabulary and rather laboriousness (creating rules takes time and effort).

    For the English language, there are many ready-made language generation packages, which also include the rules-based proposal planning system and its own generation. For example, SimpleNLG , well, and a host of others, from simple to very advanced. The situation with the Russian language is somewhat worse, but as we have seen, writing a simple language generator on formal grammar is relatively not difficult, and it can do quite a lot.

    Neural networks . Our latest development is a text-generating neural network. An article about it was recently published in the materials of the Dialogue 2015 conference ( an article in English is available here) This system learns to generate new texts with examples.

    The principle of its operation is similar to that which we already described in the article about chatbot . The difference lies in the fact that there is an additional layer of neurons, which simultaneously receives information about the current word of the sentence and the set of aspects that are included in this sentence. Thus, the list of aspects is encoded by a vector, where each dimension corresponds to one aspect, and the value of this dimension (1 or 0) encodes the presence or absence of this aspect in this sentence. The task of a neural network is to predict the next word, knowing the current word and the vector of aspects. Below is a diagram from our article, with signatures translated into Russian:



    A trained neural network, upon receiving an input list of aspects, is able to generate new offers. Here is an example of the texts that result from:

    Convenient, player, battery. Comfortable sound, metal case. Small price, and easy to use. Screen, 2 sim cards, 2 battery.

    Great weight, large screen, attractive, good camera. The design sits + a good super camera, almost all games go (some ask for RAM) big.

    Battery, beautiful speed. Design, sound, functionality, a lot of different days are enough. He is handsome, volume back, processor, responsive sensor. Beautiful screen, color reproduction. Design, battery, do not brake, practical.

    Build quality, user-friendly interface. Great amoled display, buttons, camera, and all games. design, fast internet, catches the net well. Bright size, pleasantly heavy, fits well in the hand. Beautiful screen, speed, internet, java. Rich battery, fast functionality, reliability. strong, expensive percent, excellent sound, fast percent

    Has a USB flash drive, Java application, a card reader that is not a brick, quite tiny, the speaker is not buggy. Housing quality especially large buttons, good equipment.

    The main minus is some clumsy texts, grammatical and semantic errors. Plus - diversity, a more natural feeling, there is no need to manually develop the rules. As an application option - you can generate a lot of texts, and then manually correct the crooked places - still much faster than manually writing from scratch, especially if you intend to write texts based on an analysis of real reviews.

    And of course, the model is not limited only to the subject area of ​​reviews - it can be trained in principle on any texts.

    In conclusion, I would like to quote a small fragment of Pierre Boule's fantastic story “The Perfect Robot”, 1953:

    “If a noun“ ram ”is chosen, the robot will be able to combine this word grammatically with a suitable adjective, in other words - choose the right one from such phrases as“ liquid ram ”,“ fog ram ”or“ white ram ”, excluding those that violate the rules of conformity grammatical kind and numbers, such as, for example, “radiant sheep” or “white sheep”.
    “Liquid ram” is a meaningless phrase, ”interrupted Professor Spirit of contradiction.
    - Let me finish! Everything in due time ... We do not foresee any particular complications at the next stage: in the formation of a complete phrase according to the rules of syntax. These rules are precisely defined, so that the machine will be able to accept them in the same way as the human brain, and maybe even better. So we will achieve the formation of a certain number of grammatically correct phrases, such as “a liquid ram flies in a pointed sky” or “a white ram eats grass” ...
    - That's where I caught you! - the Spirit of contradiction rejoiced. - Most of your phrases, as you say, grammatically correct, will be meaningless!

    They will be perfect in terms of form ... "

    Phrases like “card reader, which is not a brick, quite miniature” invariably remind me “a liquid ram flies in a pointed sky,” but in general, we can say that after half a century the task of automatically creating texts has moved from the realm of fantasy to the field of practical applications.

    Also popular now: