GPT-2 neural network from OpenAI. Fast start

The news about the BERT neural network from Google, which showed state-of-the-art results on a number of conversational (NLP) tasks in machine learning, did not have time to make noises, as OpenAI rolled out a new development: GPT-2. This neural network with a record number of parameters at the moment (1.5 billion, versus 100–300 million usually used in such cases) was able to generate entire pages of connected text.

It is so good to generate that in OpenAI they refused to upload the full version, fearing that this neural network will be used to create fake news, comments and reviews that are indistinguishable from the real ones.

However, in OpenAI, a smaller version of the GPT-2 neural network was shared with 117 million parameters. That is what we will run through the service of Google Colab and experiment with it.

A bit of background

For those who have not watched the progress in the processing of natural speech (NLP).

In the summer of 2018, OpenAI pre-trained on a large amount of text the GPT neural network built on the Transformer architecture. It turned out that if you replace a couple of the last layers and train it for a specific task (this approach is called Fine Tuning and is widely used in machine learning), then it beats previous records at once in a wide range of conversational tasks.

Based on this development, at the end of 2018, Google created its own BERT neural network . They seriously improved the result by making the neural network bidirectional, unlike GPT.

Not wanting to give up, in February 2019 at OpenAI they increased their GPT 10 times at once and trained it on an even larger amount of text - on 8 million web pages (totaling 40 GB of text). The resulting GPT-2 network is currently the largest neural network, with an unprecedented number of parameters of 1.5 billion (the BERT in the largest model was 340 million, and the standard BERT 110 million).

As a result, GPT-2 was able to generate entire pages of connected text. With repeated references to the names of the characters in the course of the story, quotes, references to related events, and so on. I will not give examples here, but refer to those wishing to the original article on the OpenAI blog: Better Language Models and Their Implications or the links at the end of the article.

Generating a coherent text of this quality is impressive in itself, but the most interesting thing here is different. GPT-2 without any additional training immediately showed results close to state-of-the-art on a number of conversational tasks. I repeat, who missed the importance of the moment - without any further training for the specific task!

How did they do it? Just asking the neural network the right questions.

GPT-2 architecture

GPT-2 is trained to predict the next word in a sentence. This is a classic approach for generating text. First, the palm in this area was held by recurrent (RNN) networks, in particular, LSTM. But after the invention of the Transformer architecture in the summer of 2017, it gradually became prevalent in conversational tasks. Although the original Transformer has a problem with remembering long sequences (LSTM remember longer ones), but the speed of learning and the depth of the network more than compensated for this. By the way, a number of modifications of the transformer have already appeared - with the introduction of recurrence into it ( Universal Transformers ), a modification for longer sequences ( Transformer-XL) and others, but in Google and OpenAI they use only slightly tuned original Transformer.

BERT from Google, I remind you, was learning a little differently: to predict not the next word in the sentence, but the missing (masked) words in the sentence. And also to determine whether two consecutive sentences are a logical continuation of each other, or they are in no way related to each other in meaning. This allowed BERT to be a language model that understands the meaning of words depending on their environment (context). Which determined the success of BERT in NPL tasks. But only after training (Fine Tuning) for a specific task. Just the prediction of words in the basic model does not work very well. You can play with BERT in the browser (via Google Colab): https://habr.com/ru/post/436878 .

GPT-2 do not need additional training. This is not just a language model like BERT, it is a text generator. Just give her the beginning of the phrase to the input, and then she will add her word by word.

An interesting detail: OpenAI research has shown that the arrays of Wikipedia texts and literary books (in which BERT studied, in particular) have a biased style. Therefore, neural networks trained only on them do not generate text very well. To diversify the input data and styles, OpenAI used regular web pages for learning GPT-2, collected from 8 million sites (a total of 40 GB of text). And in order to discard advertising and spam sites, they included in the sample sites, links to which in the reddit have a good rating. That is, sites that live users found contain some useful information.

The correct question contains half the answer.

So, GPT-2, due to its unprecedented size, was able to generate pages of connected text. But the most surprising thing is that by asking her the right question (that is, the correct beginning of the phrase), she was able to answer various questions! Just because the continuation of this beginning is the most natural.

For example, to get an answer to the question "What is Earth?", You can apply the beginning of the phrase: "Earth is ..." to the input of this neural network. And she will add this phrase to the end. Because the answer will be a natural continuation of such a beginning.

Moreover, forming the beginning of the phrase in the right way, you can get explanations for different target audiences, taking into account their intelligence, age and education. Imagine continuing the phrases: "I, as a scientist, believe that the Earth is ...". Or "I, as a landowner, claim that the Earth is ...". Or: "I, being a kindergarten teacher, now I will explain to you, children, that the Earth is ...".

As you can see, by forming the right questions (the right beginning of a phrase), you can get answers of completely different levels and different details. In some ways, the same happens in humans. The doctor must explain to the patient the course of the disease so that he understands. At the patient level. If a five-year-old child is asked why he did this, then he cannot immediately answer (and naturally, children live with feelings and emotions). But to give the answer that is expected of him, the child begins to invent it - to generate the text. Proceeding from the fact that the answer gave the parent and that at least somehow corresponded to what happened. At first, as many parents know, these will be absurd answers. But by encouraging and punishing (“tell me more”, “do not invent excuses”), the child will learn to give detailed and complete answers.

This development of OpenAI and the ability of the GPT-2 network to provide answers to conversational tasks without special training for a specific task open two interesting questions:

1) Can the interpretability of neural networks be achieved by such an elementary text generator and the correct beginning of a phrase? Where the answer will be a natural continuation. Let, for example, a neural network indicates the seals in a photograph not by coordinate numbers x and y, but explains its position in plain text. Then in the course of clarifying asking her the right question, for example: "I came to this conclusion, because ...", you can theoretically get an explanation of how she found the cat in the photo. And this explanation in the limiting case can be no worse than a human one. What solves the global problem of interpretability of neural networks.

2) Can a pre-trained neural network on large volumes of text be universal, have general common sense and not require additional training for specific tasks. Here it means that trying to imitate human speech (human answers to questions), the neural network must inevitably learn common sense in order to give these very similar to human answers. Giving monosyllabic bogus responses is generally not typical of people. Most of the people give detailed answers, which means the network must learn to do the same.

Both of these questions remain open, but the first step in their approval is definitely made.

Or rather?

If you are standing now, you better sit down. Because here’s how OpenAI, using the GPT-2 neural network, got its results in conversational tasks for different domains:

Answers to questions on the text

Well, it's simple. Either the networks were fed several paragraphs with a description that includes somewhere in the middle, for example, an "apple is on the table", and at the end it was attributed: "the apple is on ..." and the network added the "table". Because it is able to remember the context of several paragraphs.

Or they fed the networks as an initial phrase several examples of the form "Question: some question, Answer: some answer", and at the end after the real question they added: "Answer:". And the neural network added the answer! As revealed the structure of the document on the previous Question-Answer. It's amazing.

Summarization text

At the entrance a long text of several paragraphs or even pages is submitted, and the neural network should write a short content. How did this behavior get from GPT-2? Just after the text added "TL; DR". And that's it! That was enough for GPT-2 to add a summary of the article after these symbols! Because such symbols on the Internet often denote a brief content of a post.

Text translation

At the entrance of GPT-2, the text was submitted in the form: "hello = hello, dog = dog, wind = wind, cat = ...". And the neural network added a translation of the last word: "cat" (originally in French). Because it revealed the structure of the document and simply supplemented it with the most logical continuation. If you still have not opened your jaw from all this, then I have two news for you, and both are bad =).

Running GPT-2 through Google Colab

Unfortunately, the full version of GPT-2 in OpenAI refused to share. Motivating this by using this neural network it will be too easy to generate fake news and reviews in stores. Judging by their statement, the discussion of the feasibility of laying out this model will continue for the next 6 months, and after OpenAI they will finally decide whether to lay out or not. However, it is not difficult for a large organization to repeat the model (it looks like they trained it for 256 TPU for several days, and according to preliminary calculations it cost them about $ 45,000)

However, they posted a smaller version of GPT-2 with 117 million parameters (and not 1.5 billion, as in the full model): https://github.com/openai/gpt-2 . Let's try to run it and play with this model.

The easiest way to do this is through Google Colab:

Open link

http://colab.research.google.com/github/blade1780/bert/blob/master/Gpt-2.ipynb

In the Runtime menu, select Run All , so that all cells start for the first time, download the model and connect the necessary libraries. Agree to reset all runtime, if required. Type the text after the "Model prompt >>>" appears and press Enter.

If something went wrong ...

Make sure GPU and Python 3 are selected in the Runtime -> Change runtime type menu

If the connect button is not active, click it to become Connected.

Or create all the code manually:

Go to https://colab.research.google.com
Click on the blue NEW PYTHON 3 NOTEBOOK button
In the menu Runtime -> Change runtime type, select Python 3 and GPU (the last one to run the neural network on the GPU)
In the first cell, enter:

!git clone https://github.com/openai/gpt-2
%cd gpt-2
!sh download_model.sh 117M
!pip3 install -r requirements.txt

And click the black Play icon to the left of the cell. This will load the GPT-2 neural network and install the necessary dependencies.

In the second cell (you can add it through the menu Insert -> Code cell or by hovering the mouse under the center of the current cell, the add buttons will pop up):

!python3 src/interactive_conditional_samples.py

This will launch an interactive mode. Wait until the neural network is loaded and a window for entering text appears below, with the inscription "Model prompt >>>". Enter the beginning of the phrase and press Enter. After a while, the generated text will appear under the heading SAMPLE.

You can also run the generation of completely random text. The text will be generated for infinite time in small chunks of SAMPLE 1, SAMPLE 2, and so on, until you press the Stop button on the cell. To do this, create a new cell with the code:

!python3 src/generate_unconditional_samples.py | tee samples.txt

The result will be saved to the samples.txt file. You can download it with the following commands (again create a new cell and start it after generating the text):

from google.colab import files
files.download('samples.txt')

You can change the parameters for generating text (odds, randomness, etc., for a description, see the original work ):

!python3 src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee samples.txt

Since this is a greatly reduced model, do not expect miracles from it. Most of the generated samples will be nonsense. But come across and meaningful areas. The text should be in English; GPT-2 cannot work with other languages yet.

Examples of generated text

Samples generated by the full model text: https://blog.openai.com/better-language-models/#sample1 (at the top of the switch bar for 8 stories).

There is also a huge 2.4 Mb text file with randomly generated samples: https://raw.githubusercontent.com/openai/gpt-2/master/gpt2-samples.txt

And another, 2.27 MB, with different randomness settings: https://raw.githubusercontent.com/openai/gpt-2/master/gpt2-topk40-samples.txt