Robots in Journalism, or How to Use Artificial Intelligence to Create Content

    Cars are getting smarter. Already now they are generating content of such quality that even a professional does not always distinguish it from the “human” one. Sergey Marin from Data Studio spoke about why journalists and editors should not be afraid of competition, and about the prospects for automating journalism at our conference “ Contenting ”.



    Under the cut-off transcript of his report.

    About the speaker
    Sergey Marin is an expert on artificial intelligence, leader and founder of Data Studio .

    Three whales of artificial intelligence


    If we are talking about artificial intelligence - in journalism or in any other field - we must first of all understand its structure. AI consists of three main components: machine learning, recommendation systems, and neural networks. By the way, many consider neural networks to be a synonym for artificial intelligence, but this is only one of the tools, not even the most massive: in each case, those algorithms that work most optimally are used.



    Machine Learning: Shelf


    Machine learning is used to search for hidden patterns in data. Imagine that we have a set of information lines or publications that need to be classified, that is, automatically assign them some tags. Or just texts with a lot of words that need to be divided into certain classes, interests, moods and so on. How do we do this? If we talk about machine learning, then we are not looking for any keywords to draw conclusions based on them. Instead, we show the machine the largest possible number of texts we have already marked up with a large number of classes. After which we give a new text, and the machine itself classifies it into the area to which it belongs. That is, we first teach, show many examples.



    That is, the main application of machine learning in journalism is classification. For example, we have a large number of information lines - from the Internet, social networks, news agencies - and we need to quickly classify them. We pre-trained our model, and when we have a new information guide, the machine understands where it belongs, what its theme is, what mood it conveys, for which audience it can be applied. Popularity is predicted similarly, the rating of some news feed.

    Recommender systems: find a personal approach


    The main field of application of recommendation systems is personalization. We want to show content that is relevant for at least a certain segment, and ideally - select it for each person. In this regard, the presentation of the content is no different from sales. Recall the leaders in sales of targeted products: online stores like Amazon and online cinemas can recommend their products. And if we consider content as a product, it turns out that we already know how to recommend and target it well.



    How do we do this? There are two basic principles. The first is referral systems that, in fact, compare people among themselves based on their purchases, in this case, based on the content that they previously consumed. Let's take a simple example: Igor and Peter watched about the same films, and if one of the films was viewed only by Igor, then it is logical to recommend it to Peter.

    Another principle is much stronger in terms of recommending content - an assessment of its popularity, PageRank. The first such example is search, search in Yandex, Google. How to determine that a certain page is significant? We consider the number of links or references to this page on other resources and get a kind of rating that is assigned to it. But it’s one thing when five unknown pages link to the publication, and quite another if the links are given by popular brands or major news agencies. It turns out that we must take into account the rating of those who link to our page - we get such a hierarchy.

    Tinder works the same way: when you scroll left-right, the rating is calculated for you and for those people who are shown to you. They show you photos of those who have about the same rating with you - this is the recommendatory meaning of the service.



    This is a very effective method for automated assessment of the significance of certain information. If you know how to count not only mentions, but also their significance, you can automatically sort all news feeds for specific target audiences. Therefore, recommendations are used mainly for such level targeting.

    Neural networks: imitation of the brain


    The concept of neural networks is simple and boring. Until about the 60s of the last century, studies of the principles of the human brain painted the following picture: there is a certain set of neurons that receive input signals. After that, each neuron makes a small modification of the signal and passes it on. To understand how these neurons come together in groups within the brain, we decided to create a computer model - a set of neurons that are somehow connected. So the first neural networks were born, and in this form they are still used to solve machine learning problems. But if we are talking about something more advanced, then such a system does not fit.



    Somewhere in the 90s of the last century, scientists realized that the human brain does not work quite like that. Neurons really interact with each other, but everything is built hierarchically. For example, when I see a picture, information is collected from each of its areas, which is further aggregated to another, smaller group of neurons. And there it is stored in the form of some kind of internal representation. In fact, we think with these internal representations, and not with the real pictures that we see. The theory was immediately recreated in neural networks, and now according to the classification of images such neural networks work much better than humans. These neural networks are called convolutional - because the generalization process is taking place.



    The second breakthrough occurred when they found out: a person perceives information not in the moment, but taking into account a certain context. To train computers to analyze the accumulated experience, they built the so-called recurrent neural networks. They use the work of previous neural networks first to classify, and then to create some content. This is all used now in Sequence Modeling, and if it’s easier - in chat bots. For example, when Yandex selects similar words, these are recurrent neural networks that replicate how a person processes information.

    How neural networks are used in journalism


    The first area of ​​application for neural networks is content generation. If we have some kind of information guide, then a trained neural network allows us to determine the topic and write a quite intelligible text. Already there are companies that produce the corresponding software. There are publications that use it for routine information lines - exchange reports, financial indicators of companies. For factual information - an earthquake passed here, a ship sailed there and so on - it works fine. But if we are talking about more advanced information feeds, then we will have to seriously work to transform the content generated by the neural network into something truly meaningful and adequate.



    The second area is classification; it has already been mentioned above. The third is perception assessment or A / B testing, which is rarely used somewhere outside of sales. In journalism, the principle is similar: we have several forms of publication, and we want to test how it will go in different target groups. Using such methods, this process can be fully automated.

    The latter direction will appeal to those who need to write the same content for different channels, resources, target audiences. To publish an article on Habré, which has already been published in another publication, you can not do just copy-past. To adapt it, you can either attract a copywriter or use a neural network. For a computer, this is even simpler than machine translation: text does not need to be converted to another language, syntax, and so on. But overall it's the same.

    Where is it used? A pioneer among major agencies is the Associated Press. They use automatic content generation for financial news, in which there is little analytics, but a lot of figures and evidence. There are three vendors that make such software: Narrative Science, Automated Insights and Article Forge. If you go to their sites, you can see a lot of real cases - examples of publications written by robots. All these articles are based on some evidence.



    Is there a noticeable difference between authoring and generated content? In the United States and Germany, they conducted research, during which a large number of articles were shown to groups of journalists - respectively, in English and in German. Half of the texts were written by people, half by machines. On average, people could not distinguish between them. And when the subjects were asked to classify the texts according to their reliability and interest, it turned out that they find the texts written by the machine more reliable. At the same time, respondents noted that reading them is not as interesting as “human” articles.

    It turns out that people are better off doing entertaining content. And if you need to bring some news - use the car, they will believe it more.

    Benefits and dangers


    Robots allow you to focus on the content that you want to embed in the content, rather than on the tedious process of adapting it to different formats. Another advantage of machines is the speed of reaction: if you need to quickly process information leads, then this is your tool. We have already said about user personalization, this is a definite plus. The fourth advantage is crowdsourcing: if you use a large number of sources, the machine will be able to automatically classify the information received from them, distinguish good from bad, and choose adequate ones.



    There are potential dangers. The first is an echo camera. The content that they show me is personalized based on the similarity of my interests - taking into account what I already read, and the interests of people like me. Thus, after a certain number of iterations, I begin to cook in my closed information field.

    The second danger is information bubbles. If you create some kind of fictional situation, event, machine can write many different versions of publications that look authentic. With the help of bots, social networks and so on, such misinformation can be spread to huge audiences.



    Now they are talking about the so-called adversirial attacks on the neural network. An example with the KFC logo is given: if you show such a picture to a self-driving car, it immediately rises - artificial intelligence recognizes the image as a stop sign. If such manipulations are possible with texts, then a meaningless set of words corresponding to a certain algorithm will be able to get a high rating of neural networks, and the reader will see some kind of gibberish.



    Fortunately, in practice, such an attack is very difficult. Recall that the neural network - like our brain - brings any image in accordance with the internal representation. Look at the picture: on the left of the face, as we see them, and on the right - as the neural network sees. Having access to the neural network itself, pictures can be selected, as in the example with the KFC logo. In fact, the problem is also known from cryptography, because it is an analog of hash function hacking. The neural network in this case is a hash function: you convert a certain long text into a small internal representation. If you pick up something that matches - hack. But to be able to iterate over, you need to access the algorithm.

    Not a competitor, but an assistant


    Almost all publications on this subject raise the problem of the demand for journalists in the future. The question, it seems to me, is not entirely correct: someone will be replaced, someone not, but it is clear that all journalism cannot be replaced with machines. A person will yield to them only some basic, banal, simple publications. The problem is different: since basic publications can be created automatically and done easily, the percentage of very soon generated content will be much more than that written by people. As we have already found out, the generated content is perceived better in terms of reliability - and this allows you to create a powerful tool for manipulating consciousness and perception. This is probably the worst and most important thing.



    To create content using machine learning, the human-machine interaction process is used - not separately, but together, in a pair. First, the machine searches for informational issues, classifies them, predicts importance, generates content ... This is a case for the case when we have a large flow of all kinds of information, and we want to quickly respond to it. If you have time to think and so on, this is a completely different scenario. The content prepared by the machine goes to a journalist or editor who watches, evaluates, appends. Further, the text can go to the publication or again to the robot - in order to form different versions of the publication for different target audiences. After that, the car is engaged in personalization, chooses for each person what to show him. Of course, not everywhere it is implemented all together,

    A person is not excluded from the content preparation process. Robots are nothing more than additional tools that speed up and simplify the process, remove routine tasks from us.



    The reports from “ Contenting ” in video format can be ordered here . For Habr users a discount on the habr_online_promo promo code.

    Thanks to the sponsors:




    Friends, for another 10 days we accept applications for a techno-technical contest on the topic “State and IT” and invite all technical authors to participate. You can tell a story about technology, development, refinement of services, the device of various systems and applications, interviews with an expert, a selection of life hacks, a review and other materials on the topic - the main thing is that they are published on Habré. Detailed information on the competition page .

    Also popular now: