Practical Aspects of Automatically Generating Unique Texts for SEO

    The most terrible horror story for those who want to place computer-written content on their sites is search engine sanctions. At one time, we were also scared by the fact that a site with non-unique and / or generated texts would be poorly indexed, or that it would generally fall under the ban. At the same time, no one could tell us the exact requirements for the texts. In general, the theme of unique content and its role in website promotion is more like occult knowledge. Each next “specialist” promises to reveal the terrible truth on his page, but the truth does not open, and the essence of many discussions on the forums is that, let’s say, Yandex recognizes generated content using magic. Not in these words, but the point is this.

    Since customers have recently approached us with the task of creating descriptions for products on the site, we decided to study this issue in more detail. What algorithms exist for determining automatically written texts, what properties should a text have in order not to be recognized as web spam, and what means can generate it?

    In recent years, unique text (and generally text) has become a common tool that SEO experts recommend for promoting sites in search engines. Just in recent years, site owners realized that it’s rather expensive for people to order texts, because the prices for author texts at all times ranged from $ 1- $ 3 per 1000 characters. It is clear that the owner of an online store, even with a modest assortment of 3-4 thousand items, needs to pay 300,000 rubles for texts, and this is not a one-time expense, since the assortment tends to be updated. Naturally, on the pages of the sites appeared automatically generated product descriptions.

    How the search engine actually recognizes automatically generated content ...
    ... of course we don’t know. But, the general principle of the method of secrecy is not, and turning to the primary sources, you can draw some reasonable conclusions about the boundaries of the possible. To begin with, there is an article on the site of Yandex's scientific publications with the promising title “Search for Unnatural Texts” [1]. It says something like “the distribution of pairs [words] in an unnatural text should be violated ... the number of rare, uncharacteristic for the language pairs should be overestimated compared to the standard, and the number of frequent pairs should be underestimated”. Before us, therefore, is the first group of methods. That is, one way or another, we are talking about comparing the statistical parameters of a given text with the parameters of “natural” texts. In addition to pair distribution, larger n-gram frequencies can be used.

    It is clear that the most primitive descriptions generated by substituting product parameters in the template text avoid this filter due to the fact that the original template was prepared by a person and, accordingly, has natural characteristics. This is of course, provided that the template matches the correspondences of childbirth and cases so that it doesn’t work out anything like "Buy a washing machine for 10,399 rubles."

    Generators based on modern language models, such as neural network language models, are also very likely to avoid this filter, as the general rule says "to catch the text generated by some language model, you need to use a more advanced language model." A better language model may be in short supply, and also require enormous computational costs, so using it to determine automatic texts on the Internet will simply be irrational.But

    generators based on the language model, applied directly, generate texts that are meaningless. For example, such "The reliability of water heaters" Ariston "wins the rating of boilers."

    Since owners of online stores generally do not want water heaters to win boiler ratings, they prefer simple boilerplate texts. But there is also some potential difficulty.

    Template text is indistinguishable from natural as long as it is available in a single copy. Propagated, they become the subject of a second class of methods for determining machine texts. The essence of the method is that all texts written on the basis of the template are similar to each other except for the parts where the parameters of a particular product are inserted. It turns out what is called in English literature "near dublicates" - almost duplicates. Search engines are able to determine them [3], using the well-known method of singles and its advanced options. If you use an additional synonymizer, then the number of unlikely language constructs will increase and the text will become recognizable for the first group of algorithms [1]. In addition, there are algorithms specifically directed against synonymizers - they remove all words from the text,

    Thus, the recognition algorithms of machine-generated texts, which are rather complicated on the one hand, still do not contain any magic and superintelligence. If desired, they can be reproduced for text testing purposes, which is time consuming, but generally not difficult.

    Philosophical digression
    We are faced with the fact that there are people who consider machine texts to be evil, clogging the Internet and intended to deceive users. But we believe that it is hardly legitimate to refer to meaningful texts that describe specific products by parameters. After all, these texts contain virtually the correct information about the product. By placing such text on a page, we designate its content for a search engine, so this is not a hoax to search engines or buyers.

    Practice: How good are machine texts?
    In view of the foregoing, we settled on a hybrid method for generating texts. In it, first the basic frame of the text is generated using a manually set grammar ( for more details, see the previous article ), and then a neural network analyzer is used at the top, trained to identify places where you can insert or delete certain classes of words without losing meaning. The need to create a generating grammar manually of course increases the cost of the solution, but still it remains an order of magnitude less than ordering texts for a copywriter. Now, in terms of quality.
    Readability :

    "Grohe Allure basin mixer 19386000 from the new Allure collection, costing only 5800 rubles. Flush mounting provides enhanced ease of use and, of course, installation. The GROHE SilkMove system allows for extremely easy lever movement. The special coating produced by StarLight technology creates durability and maintains a good appearance of the product for many years. Vertical installation with two mounting holes is very convenient and should not cause difficulties. The spill discharge here is 220 mm. A larger offset size makes it much easier to use the product. The whole product has a total weight of 1.955 kg. The minimum pressure for this model is 1 bar. There is no need to connect to electricity. Free shipping and reliable, proven over the years,".

    Of course, this is not a great literary work, but there are no obvious flaws. Determining that text is automatically generated is difficult, even for humans.
    Uniqueness:
    a) Global uniqueness. The essence of global uniqueness is that the text is unique relative to all other texts available on the Internet at the time of publication.

    To test global uniqueness, we used the well-known text.ru service (for the purpose of objectivity, in this article we present the results of analysis from third-party services, and not the data of our algorithms).



    As you can see, there is no problem with global uniqueness. The service complains about spelling, but when considering errors, they are related to the use of the words “Allure”, “StarLight” and other specific terms that the service does not know. Note: this is data before posting texts on the customer’s website. Now, of course, these texts can be found there.

    b) Local uniqueness. As we have already said, too similar texts can be considered duplicates of each other by the search engine, which can give out their artificial origin. To do this, we used the service available on the backlinkmanager website (other comparisons using the shingle algorithm give similar results)



    Two texts about very similar models with matching parameters are only 5% similar, and the similarity is largely due to the mention of the product name “Grohe Alira sink mixer”. We will consider this a good result, because there are not many ways to describe the same set of product parameters in different ways.

    Search Engine Indexing
    The indexation of machine-generated texts was checked by us earlier on the example of the site reviewdot.ru. The pages of this site do not have unique content. Therefore, at first this site did not want to get into the Yandex index (out of more than one hundred thousand pages, about 1300 were in the index). We fought hard with this, posting template texts first (the number of pages in the index grew to 5000), then using more complex generation algorithms like the one discussed above. Today, the Yandex index has about 70,000 pages. Although what exactly affected the situation - our efforts or changes in Yandex algorithms, is unknown to us. Nevertheless, the fact remains - pages containing automatically generated texts successfully fall into the index of search engines. Despite all the concerns of SEO experts,the monsters did not appear that the site didn’t gobble up us under the sanctions of search engines, although there were theoretical reasons for this.



    Moreover, in the index there are not only pages, but also specifically automatically generated texts, which can be seen by entering fragments of these texts in the search bar:


    So, at least, machine-generated content can be used to make the page relevant to certain queries.

    Of course, it should be noted that we did not post meaningless texts, but texts containing information useful to the user (reviewdot analyzes reviews of products left on different sites and provides the user with a brief summary of the marked pros and cons).

    We also compared the time spent by the user on the pages with the text. As a result, it was found that the texts had a positive effect on such a parameter as the time the user spent on the page. Apparently the reason for this is that if a person sees a coherent text on the page containing the information he needs, he begins to read it, and reading the text takes some time.

    Concluding remarks
    To date, the texts have been submitted to the customer and posted on the website ( online plumbing store g-online.ru), those who wish can familiarize themselves with them, too. So far, we can conclude that the generated texts can be made quite similar to the “natural” ones, and with the right approach to business, they do not negatively affect the site. Generated texts can improve the indexing of site pages, and make pages relevant to specific queries. You can program the generator to mention the specified keywords or phrases in precisely specified percentages of the text size.

    Literature
    1. E.A. Grechnikov, G.G. Gusev, A.A. Kustarev, A.M. Raygorodsky. Search for unnatural texts // Proceedings of the 11th All-Russian Scientific Conference “Digital Libraries: Advanced Methods and Technologies, Electronic Collections” - RCDL'2009, Petrozavodsk, Russia, 2009.
    2. Aharoni, Roee, Moshe Koppel, and Yoav Goldberg. Automatic Detection of Machine Translated Text and Translation Quality Estimation // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 289–295, Baltimore, Maryland, USA, June 23-25, 2014.
    3. GS Manku , A. Jain, and A. Das Sarma. Detecting Near-duplicates for Web Crawling. In Proceedings of the 16th WWW Conference, May 2007
    4. Zhang, Qing, David Y. Wang, and Geoffrey M. Voelker. “Dspin: Detecting automatically spun content on the web.” NDSS, 2014.

    Also popular now: