Game of Thrones. Search for dialogue authors in books

Original author: Erik Germani
  • Transfer
  • Tutorial


Hi Habrahabr,

Based on the vote in the Graph Theory article in the Game of Thrones , I am translating the training material by Erik Germani, who received the social link graph from the first 5 books of the Song of Ice and Fire series, which formed the basis of the above article. The article does not contain a detailed description of machine learning methods, but rather tells how, in practice, existing tools can be used to search for dialogue authors in the text. Caution, a lot of letters! Go.

This tutorial is aimed at beginners in machine learning, such as myself, when I started this project a year ago. (And whom I still am, although now I am just green, and not bright green in this thread.) We will build a model that can determine who says the dialogue line in the books of George R.R. Martin's “Song of Ice and Fire.” For this, we will use the method of conditional random fields CRF ( approx. From Conditional Random Fields ) and the wonderful utility CRFsuite from Naoaki Okazaki. For text processing, we will use Python 2.7 and NLTK (Natural Language Toolkit).

I will try to explain everything as detailed as possible. I hope that when describing each step of my actions, you will be able to extract new tools and methods for yourself that will be useful in your own projects. The code will be explained from a beginner to a beginner, one who understands the Python syntax and knows about list abstraction, but nothing more. If you feel that my code clarifications are draining your soul, skip them.

Important: If you wanted to find here a theory about the method of conditional random fields, then this material is not for you. For me, CRFsuite is a beautiful black box that I touch with my monkey legs. We will spend some time increasing the efficiency of the model, but this will prove to be an erroneous attempt. If this upsets you, keep in mind:
  • I managed to achieve a good result (~ 75% accuracy) with CRFsuite out of the box
  • There will not be LaTeX

Our game plan is simple. As with any other machine learning algorithm, we need to prepare the data for training and validation. Then we select the properties that the algorithm will use for classification. After we process the text using these properties, we feed the result to CRFsuite and congratulate ourselves on a job well done. (or burden ourselves with the painstaking work of checking the guesses of the machine).

Let's start.

Download text


First of all, we need to find a copy of the source of the text and I will leave the choice for you whether you will pay an iron price for it or not.

If you are new to natural language processing, you might underestimate how difficult the source code can be. Each .txt file has an encoding that determines how each character will be described. ASCII, the format in which I read the walkthroughs of Ocarina of Time, was superseded by UTF-8, which can handle all special characters. (ASCII can represent 128 characters.) My copy of PLIP (approx. A Song of Ice and Fire) in UTF-8, which will bring a little inconvenience, but in fact is a bonus.

We will upload this text to NLTK to make it easier to manipulate. NLTK can do a ton of tasks, and that’s how I learned Python, if that turns out to be interesting to you, take a look at their excellent online book . For our purposes, use this tool to split text into tokens. This implies the division of sentences into words and punctuation marks, which is often done in natural language processing projects.
import nltk
nltk.word_tokenize("NLTK is ready to go.")
['NLTK', 'is', 'ready', 'to', 'go', '.']

NLTK has preloaded shells, but we need to upload our own.

Create a folder and paste the PLIP text files there. Since the books are very large, the public source text will be almost 10 MB. Not ideal for searching and replacing in text. I divided the text into books, but real professionals who are going to analyze more will rather divide each book into chapters and number them sequentially.
But let's not complicate everything now! Once the text is in the folder, we can run the following:
corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', 'George.*\.txt', encoding = 'utf-8')

Here r indicates not to process the string. It doesn’t matter here, because I directly access the “corpus” folder, but if in your case the folder has a difficult location, it is better not to forget about it.

The second argument is a regular expression that tells NLTK to take all the files in the folder that contain “George” with the extension “.txt” in their names.

The encoding parameter is very important - if the encoding of the text does not match the specified, then errors will sprinkle.

The case in NLTK is very useful, with it you can get information from the text at different levels.
corpus.words("George R. R. Martin - 01 - A Game Of Thrones.txt")[-5:]
[u'the ', u'music', u'of ', u'dragons', u '.']
corpus.words()[0]
u'PROLOGUE '
corpus.sents()[1][:6]
[u '\ u201c', u'We ', u'should', u'start ', u'back', u ', \ u201d']

Here we hear the doomed Gared from the prologue of the Game of Thrones and see some Unicode characters represented in Python. You see that all Unicode strings begin with u , and contain special characters. \ u201c is the left quote, \ u201d is the right. I mentioned that UTF-8 is more of a bonus, and here's why. Let's see what happens if we open the same file without specifying the encoding.
bad_corpus = nltk.corpus.PlaintextCorpusReader(r'corpus', '.*\.txt')
bad_corpus.sents()[1][:9]
['\ xe2', '\ x80 \ x9c', 'We', 'should', 'start', 'back', ',', '\ xe2', '\ x80 \ x9d']

Just as \ u points to a Unicode string, \ x points to a hexadecimal string, so NLTK gives us 3 hexadecimal bytes - \ xe2, \ x80, \ x9c - and tries to split them. You can see that he does not know how to do this.

We will work with paragraphs, so let's take a look at one of them:
print corpus.paras()[1]
[[u '\ u201c', u'We ', u'should', u'start ', u'back', u ', \ u201d', u'Gared ', u'urged', u'as', u'the ', u'woods', u'began', u'to ', u'grow', u'dark ', u'around', u'them ', u'. '], [u' \ u201c ', u'The', u'wildlings', u'are ', u'dead', u '. \ u201d']]

You may notice how NLTK structures data. Offers are a list of tokens, and a paragraph is a list of offers. Easy enough!

Tags


Next, we need to prepare the data for training, but to do this, we need to decide on the labels that we will use. When parsing the text, the algorithm knows that any token belongs to the lexical category, each of which has its own label. JJ is an adjective, NN is a noun, IN is a preposition. These labels play a key role in the reliability of our model. The Penn Treebank ( approx. Project for text label ) highlights 36 such labels.

What will be our tags? The simplest option is character names. This will not work for several reasons:

  1. PLIP contains more than a thousand characters. This is too much choice for our poor model. We need to weed out as many tags as possible in order to correctly classify relying on banal luck.
  2. Characters are treated differently. Joffrey can be either “Joffrey,” or “Joff,” “Prince,” or even just “him.”
  3. If we use character names as labels, they must be defined in the training data. Otherwise, our model will not be aware of their existence and therefore will not be able to determine them.
  4. All characters simply sound the same. (I realized this thanks to another experience with machine learning, where I tried to separate the characters according to their vocabulary). Some have catchphrases such as “regrettable” ( approx. Grievous ) for Varis and “Hodor” for Hodor, but this is rare. In addition, for many there is not enough time for talking to distinguish them from the rest.


Although the definition by the names of characters sounds very tempting, let's drop this idea and think about the process that occurs in the reader’s head when solving a similar problem.

Take the book closest to you, open a random page, and try to determine who is talking there. How do you do this? You look at the closest proper names next to the dialog.

"Will saw them," Gared answered.

[...]

Sir Weimar Royce looked at the sky without any interest. “The night every day comes around the same time. Does darkness take your courage, Gared? ”

Although not every line of the dialogue is marked. Look further and see:

“Have you noticed the position of the bodies?”

You will look at the paragraphs above and below. Here are 2 on top:

“And the weapon?”

“A few swords and bows. One had an ax, heavy one, with two blades ... cruel iron. He was lying on the ground next to this man, right at his hand. ”

Not a drop of a hint. Two paragraphs below:

Will shrugged. “One was sitting near the cliff. The rest were on the ground, or something. ”

“Or asleep,” Royce suggested.

We know that Will would not ask himself, so we can say that he is not the author of this speech, and since many dialogs stretch into several paragraphs, we assume that the author of the first lines is Royce.

This scheme will help mark our model. We will teach her to determine her own names next to the text and if there are none, look in the nearby paragraphs. Then, our marks will be:

PS ± 2, FN ± 2, NN ± 2, Others.

PS - after the speaker. If the label of the paragraph is PS -2, this will mean that the name of the speaking part of the dialogue is located two paragraphs above. If FN 1, then the first name in the next paragraph. NN 0 means at least 2 names precede the dialogue and we need the one closest to the dialogue.

I will also determine ADR ± 2, for the characters that are referred to in the text of the dialogue.

Mark


Now we will prepare the training data. It will help us in this SublimeText. I opened the text “Game of Thrones”, highlighted the left quotation mark, chose Find -> Quick Find All, and pressed the Home key twice. Now the cursor is near the beginning of each paragraph with a dialogue. Then I typed "{}". Because there are no curly braces in the text, then we can use them to leave notes that we will use in the future.

We will use the regular expression (? <= \ {) (? = \}) To jump over curly braces. If you have not met with this design, then they are called positive retrospective and leading checks. The first expression in parentheses will cause SublimeText to start highlighting lines that have an opening curly bracket (escaped with a backslash) at the beginning. The following expression will say stop when there is a closing brace. As you can see, both expressions consist of a construction? =, Only the first contains also <.

Now you can go between brackets by pressing F3, which is a hotkey for finding the next one in SublimeText under Windows. This kind of optimization is important because You will tag approximately a thousand dialogs. At least I did so much. It was not as hard and time consuming as I expected. (Although maybe I'm lying, because I finished only a year later).

Before you get started, I want to make one point: think about whether you want to use positional labels (PS, FN, NN) or all the same character names. I know that I already said that we will not use names, but if you decide to use positional labels, then you associate this training data with the corresponding model. If you mark John’s dialogs with the label “Jon”, then in the future you will have the opportunity to change the label to a positional one, or use other labels if you find it better.

I think there is no single answer. Last year I tagged with character names. Now I need to make preliminary manipulations that add ambiguity. If the name of Eddard appears 2 paragraphs above and one paragraph below, then which one to choose? This will directly affect the behavior of the model and doing it automatically makes the process even more inaccurate. Therefore, I am not sure what to advise. It seems to me that from the point of view of the manual tag, it is easier to write the name of the character, but, from the point of view of automation, it is much more convenient to have positional tags.

Retrieving Properties


Well, you have tagged part of the text. I applaud you for your commitment to natural language processing. All we need to do now is write a few functions that will take the paragraph as an argument and mark them with properties that are of interest to us.

Remind me what properties? The workhorses responsible for the accuracy of the model are the following functions: whether PS, FN or NN exist in the current paragraph or in the neighboring paragraphs.

Name Search


Our first function is to find proper names. This can be done by defining parts of speech.

sentence = corpus.paras()[33][0]
print " ".join(sentence)
print nltk.pos_tag(sentence)
"Such eloquence, Gared," Ser Waymar observed.
[(u '\ u201c', 'NN'), (u'Such ',' JJ '), (u'eloquence', 'NN'), (u ',', ','), (u'Gared ',' NNP '), (u', \ u201d ',' NNP '), (u'Ser', 'NNP'), (u'Waymar ',' NNP '), (u'observed', 'VBD '), (u'. ','. ')]

NPP near Ser and Waymar means these are proper names. But there are also disadvantages:
  1. Errors happen. Notice how the closing quotation mark became a proper name?
  2. Identifying parts of speech takes time.

%timeit nltk.pos_tag(sentence)
100 loops, best of 3: 8.93 ms per loop

asoiaf_sentence_count = 143669
( asoiaf_sentence_count * 19.2 ) / 1000 / 60
45.974079999999994

In PLIP, there are many paragraphs for processing and more than 45 minutes to determine parts of speech will delay the testing and refactoring process. Of course, one could analyze everything once and continue to work with what happened. But for this, one would have to deal with yet another data structure and such a definition would have to be redone every time the source text changes. (And this is inevitable.)

Fortunately, it is not necessary to contact parts of speech to determine character names. This is one of the advantages of choosing PLIP for analysis: there are tons of data that has already been received. Let's scrape some of them.

Existing information


It turned out to be very useful Wiki Songs of Ice and Fire, I got an almost exhaustive list of character names by literally copying a page with a list of heroes . The result can be found here . If this is enough for you, then we will meet in the next chapter of the article . For those who are interested in how to automatically extract data from the page, I will give a couple of ways that I used in other projects.

Wget


An excellent utility that is very simple if you need to follow pre-known links. You don’t have to think about how to bypass links, you just need to create a file with a list and transfer it using the -i flag , like this:
wget -i list_of_links.txt

Requirements


Python has a requests library that is well suited for working with individual pages.
import requests
r = requests.get("http://awoiaf.westeros.org/index.php/List_of_characters")
html = r.text
print html[:100]

Parsing


After downloading html, we need to peel the page from unnecessary tags in order to get to the links. BeautifulSoup is an HTML parser that allows you to get links without too much fuss. After installation and parsing, you can find all the links simply by running:
parsed_html.find_all("a")

Here you can read more about it .

I want to talk about another way in which the lxml library is used . Using this library you can work with Xpath. I am new to Xpath, but this is a powerful way to navigate the tree structure.
import lxml.html
tree = lxml.html.fromstring(html)
character_names = tree.xpath("//ul/li/a[1]/@title")
print character_names[:5]
['Abelar Hightower', 'Addam', 'Addam Frey', 'Addam Marbrand', 'Addam Osgrey']

If you look askance at the Xpath expression from above, here is what it does:
tree.xpath("//ul        # выбирает все не нумерованные списки
             /li        # выделяет элементы списков
             /a[1]      # выделяет первую ссылку в элементе.
             /@title    # возвращает атрибут title
           ")

Now, you need to select the names among the result and delete that which has nothing to do with the name. Just going over the PLIP page, I noticed elements of the “Taena of Myr” look. We do not want our model to match the “of” particle to dialogs.

NLTK will help with this. It has a body of text with "bad" words - stopwords. Those that are so common that they make no sense to characterize the text.
particles = ' '.join(character_names).split(" ")
print len(set(particles))
stopwords = nltk.corpus.stopwords.words('english')
print stopwords[:5]
particles = set(particles) - set(stopwords)
print len(particles)
# Кое что всё же проскользнёт. Т.к. Aegon I в списке ПЛИП, то римская 
# цифра I будет восприниматься как имя. Нужно почистить это вручную.
"I" in particles
2167
['i', 'me', 'my', 'myself', 'we']
2146
True

And in the end, you need to add some more, possibly missed nicknames, such as Denis, Black Fish or Joff. If you are happy with the list of names, then save it in a file for future use.

Search for names. Part 2


We abandoned the idea of ​​finding names using parts of speech and got a list of names. We will extract the sequences of tokens and see if we can find them in our list of names. Finally it's time to write the code.
import itertools
from operator import itemgetter
particles = [particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt')]
tokens = [u'\u201c', u'Such', u'eloquence', u',', u'Gared', u',\u201d', u'Ser', u'Waymar', u'observed', u'.']
def roll_call(tokens, particles):
    speakers = {}
    particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]
    for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):
        index_run = map(itemgetter(1), g)
        speaker_name = ' '.join(tokens[i] for i in index_run)
        speakers[min(index_run)] = speaker_name
    return speakers

This function uses a lambda expression that I could not use last year when I did this project. The script that I used then is so terrible and not readable that I did not dare to publish it. In addition, I think that in this script, beginners can learn something new, so a little more about this.

Itertools is a noteworthy tool. I often use it to get rid of nesting or for permutations. In it we need the groupby function . Due to the release of a new version of this function at the time of writing, I completely preferred groupby rather than dropwhile and takewhile, which I used in a recursive manner.

When programming, I thought the roll_call functionmust know the position of the names that he found. So I decided to keep all the serial numbers of the names. This can be seen in the 3rd line of the function code.

particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]

Enumerate helped me a lot with my Python experience. It takes a list and for each element returns a bunch of serial number and the element itself.

4th line is the trickiest part of the code in all the material and I didn’t write it. It is taken directly from the library documentation .
for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):

Groupby goes through the list and groups the elements depending on the result of the lambda function. Lambdas are anonymous functions. Unlike roll_call, they do not need to be defined in advance. This is just the part of the code that takes arguments and returns a value. In our case, it simply subtracts a number from the serial number.

Let's take a look at how it works.
print tokens
particle_indices = [i for (i, w) in enumerate(tokens) if w in particles]
print particle_indices
for index, location in enumerate(particle_indices):
    lambda_function = index-location
    print "{} - {} = {}".format(index, location, lambda_function)
[u'\u201c', u'Such', u'eloquence', u',', u'Gared', u',\u201d', u'Ser', u'Waymar', u'observed', u'.']
[4, 6, 7]
0 - 4 = -4
1 - 6 = -5
2 - 7 = -5

This is the trick with groupby : the indexes are numbered sequentially, so if the items in the list also go one after another, then the lambda result will be the same for them.

groupby sees -4 and assigns a value of 4 for the group. 6th and 7th elements both have -5 and are respectively grouped.

Now we know where the compound names are and should use them. What does groupby return ? The key, the result of our lambda, and the group itself, the grouper object . Next, we use the map function to apply itemgetter (1) , which extracts an element from the bundle, to all elements of the group and thus we create a list of names in the original token list.

Aftergroupby we just need to extract the names found and store them in an associative array of speakers .
roll_call(tokens, particles)
{4: u'Gared ', 6: u'Ser Waymar'}

Optimization


Let's compare the speed of this function with the method in which we used parts of speech.

%timeit roll_call(tokens, particles)
100 loops, best of 3: 3.85 ms per loop

Not bad, 5-6 times faster. But we can improve the result by using set . The set sets almost instantly check if an item is in the list.
set_of_particles = set(particle.rstrip('\n') for particle in open('asoiaf_name_particles.txt'))
%timeit roll_call(tokens, set_of_particles)
10000 loops, best of 3: 22.6 µs per loop

You understand that you are good when you see Greek letters in speed.

Searching for Conversation Names


Now we need to write a program that will call the above function in the right places, so as to find the names of the characters before, in and after the text of the dialogs. We will put all this into a class that can collect a complete list of character names for us, which we will pass on to another algorithm for retrieving properties and then to CRFsuite.

But first, I would like to tidy up our data.

XML parser


After a successful one-line command with Xpath, I decided to write an XML parser for our text files. There is a ton of meaning in choosing this format. PLIP is a lot of books in which there are chapters, which in turn consist of paragraphs, and some of them contain dialogs - and we need to discreetly mark them. If I hadn’t translated the text into XML (and at first I didn’t), the labels would have littered the text itself.

I prefer to keep silent about the script below: it reminds me of my first steps in Python, huge functions, crutches and variables with long names.
from lxml import etree
import codecs
import re
def ASOIAFtoXML(input):
  # Каждый элемент input должен быть ассоциативным массивом названий глав с расположением его на диске.
  root = etree.Element("root")
  for item in input:
    title = item["title"]
    current_book = etree.Element("book", title=item["title"])
    root.append(current_book)
    with codecs.open(item["contents"], "r", encoding="utf-8") as book_file:
        #Ловушка для глав, названия которых не распознаются регулярным выражением.
        current_chapter = etree.Element("chapter", title="Debug")
        for paragraph in book_file:
            paragraph = paragraph.strip()
            if paragraph != "":
                title_match = re.match("\A[A-Z\W ]+\Z", paragraph)
                if title_match:
                    current_chapter = etree.Element("chapter", title=title_match.group())
                    current_book.append(current_chapter)
                else:
                    current_graf = etree.SubElement(current_chapter, "paragraph")
                    while paragraph != "":
                        current_dialogue = current_graf.xpath('./dialogue[last()]')
                        speaker_match = re.search("(\{(.*?)\} )", paragraph)
                        if speaker_match:
                            speaker_tag = speaker_match.group(1)
                            speaker_name = speaker_match.group(2)
                            paragraph = paragraph.replace(speaker_tag, "")
                        open_quote = paragraph.find(u"\u201c")
                        if open_quote == -1:
                            if current_dialogue:
                                current_dialogue[0].tail = paragraph
                            else:
                                current_graf.text = paragraph
                            paragraph = ""
                        elif open_quote == 0:
                            current_dialogue = etree.SubElement(current_graf, "dialogue")
                            if speaker_name:
                                current_dialogue.attrib["speaker"] = speaker_name
                            close_quote = paragraph.find(u"\u201d") + 1
                            if close_quote == 0:
                                # функция find возвращает -1 в данном случае, поэтому сравнивая с 0
                                # мы определяем нет ли там больше закрывающей кавычки. Это происходит
                                # в длинных монологах разбитых по параграфам.
                                close_quote = len(paragraph)
                            current_dialogue.text = paragraph[open_quote: close_quote]
                            paragraph = paragraph[close_quote:]
                        else:
                            if current_dialogue:
                                current_dialogue[0].tail = paragraph[:open_quote]
                            else:
                                current_graf.text = paragraph[:open_quote]
                            paragraph = paragraph[open_quote:]
    return root
tree = ASOIAFtoXML([{"title": "AGOT", "contents": "corpus/train_asoiaf_tagged.txt"}])
# Так мы сохраняем дерево в файл.
# et = etree.ElementTree(tree)
# et.write(codecs.open("asoiaf.xml", "w", encoding="utf-8"), pretty_print=True)

The essence of the code above: we use lxml to create a tree, then we go line by line through the text. If the line is recognized as the name of the chapter (uppercase letters, punctuation and spaces), we add a new chapter to the top of the current book. Once we are in the text of the chapter, we wade through the paragraphs, using another regular expression to determine who spoke the dialogue and add it to the corresponding vertex of the dialogue. Previously, they should already be labeled, of course.

Interesting note on XML. This is a hierarchical structure, therefore, by its nature, it requires strict branching, the top is at the top. But this is not so in prose. In prose, dialogues are inside the text. lxml provides a solution: text and tail. Thus, the XML vertex stores the text, but this text is interrupted after the next vertex is added.
markup = '''Worse and worse, Catelyn thought in despair. My brother is a fool.
Unbidden, unwanted, tears filled her eyes. 
“If this was an escape,” she said softly,
“and not an exchange of hostages, why should the Lannisters
give my daughters to Brienne?”'''
graf = lxml.etree.fromstring(markup)
print graf.text
Worse and worse, Catelyn thought in despair. My brother is a fool.
Unbidden, unwanted, tears filled her eyes.

print graf[0].text
"If this was an escape,"

What will happen to the remaining “she said softly”? We will save in it in the variable vertex tail .
print graf[0].tail
she said softly,

And so on, adding the rest of the text to each vertex of the dialog.

As a result, this greatly simplifies our search for dialogue authors when we need them. And we need them right now!
class feature_extractor_simple:
    """Analyze dialogue features of a paragraph. Paragraph should be an lxml node."""
    def __init__(self, paragraph_node, particles, tag_distance=0):
        self.paragraph = paragraph_node
        self.particles = set(particles)
        self.tag_distance = tag_distance
        self.raw = ''.join(t for t in self.paragraph.itertext())
        self.tokens = self.tokenize(self.raw)
    def tokenize(self, string):
        return nltk.wordpunct_tokenize(string)
    def find_speakers(self, tokens):
        speakers = {}
        particle_indices = [i for (i, w) in enumerate(tokens) if w in self.particles]
        for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):
            index_run = map(itemgetter(1), g)
            speaker_name = ' '.join(tokens[i] for i in index_run)
            speakers[min(index_run)] = speaker_name
        return speakers
    def pre_speak(self, prior_tag="FN", near_tag="NN"):
        # Имена до диалога.
        features = {}
        if self.paragraph.text is not None:
            speakers = self.find_speakers(self.tokenize(self.paragraph.text))
            if len(speakers) > 0:
                features.update({"{} {}".format(prior_tag,self.tag_distance): speakers.values()[0]})
            if len(speakers) > 1:
                features.update({"{} {}".format(near_tag,self.tag_distance): speakers[max(speakers.keys())]})
        return features
    def dur_speak(self, tag="ADR"):
        # Имена адресатов.
        features = {}
        for dialogue in self.paragraph.itertext("dialogue", with_tail=False):
            tokens = self.tokenize(dialogue)
            named = self.find_speakers(tokens)
            addressed = {k: v for (k, v) in named.items() if tokens[k-1] == "," or tokens[k + 1 + v.count(" ")].startswith(",")}
            if len(addressed) > 0:
                features.update({"{} {}".format(tag, self.tag_distance): addressed[max(addressed.keys())]})
        return features
    def post_speak(self, tag="PS"):
        features = {}
        # Имена после диалогов.
        tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]
        for tail in tails:
            tokens = self.tokenize(tail)
            speakers = {k: v for (k, v) in self.find_speakers(tokens).items() if k <= 1}
            if len(speakers) > 0:
                features.update({"{} {}".format(tag, self.tag_distance): speakers[min(speakers.keys())]})
                break
        return features

A few words about these functions.

If you are new to Python, then don't be afraid of classes. You just need to write regular functions, passing them self as an argument . This will let Python know which object the function is currently working with. A class is like a clone factory, and an object is a clone. All clones have the same DNA, these are methods and variables, but because of their life experience, their personalities differ, which in this context is the data transmitted to them.

Classes also have a special function __init__ , which allows you to initialize object variables.
Now you can relax, as your data is in the hands of a specialized class. And since you abstracted his behavior, then at the click of a finger you can get the information processed by him.
paragraph = tree.xpath(".//paragraph")[32]
example_extractor = feature_extractor_simple(paragraph, particles)
print example_extractor.raw
print example_extractor.pre_speak()
print example_extractor.dur_speak()
print example_extractor.post_speak()
"Such eloquence, Gared," Ser Waymar observed. "I never suspected you had it in you."
{}
{'ADR 0': u'Gared '}
{'PS 0': 'Ser Waymar'}

If you are confused by the work of some functions, I will briefly explain what they do. If everything from above looks acceptable to you, then you know what to do, see you in the next chapter.

There is a clumsy manipulation of the associative array, and this is because they are not ordered in Python. It reminds me of a feeling when leaving home you feel that there are no keys in your pocket, locking the door. I had to constantly check whether we get the first or last character, depending on the case, I look at the value of the keys and select the minimum / maximum.

pre_speak


As I said above, the text attribute contains all the text up to the first line of the dialog. We just need to find the names of the characters in it.

dur_speak


In the case when the name is in the body of the dialogue, which can consist of many lines, we need to go over them all:
for dialogue in self.paragraph.itertext("dialogue", with_tail=False)

The itertext function in lxml allows you to get all the vertex text. We will also set the flag with_tail = False to search only vertices without a “tail”, which means only the text of the dialog.

As soon as we find the names of the characters, we need to select in them only those that are separated by a comma, which will allow us to find an appeal. (for example, “Ned, promise me.” / “Promise me, Ned.”)

I feel inside that the last name found in the dialogue is likely to answer in the next paragraph, so we will rewrite the addressee with the last name mentioned.

post_speak


For this function, we need only the first character after the dialogue. Therefore, we interrupt the cycle as soon as we find one.

The function looks in the first 2 tokens after the closing quote. So you will find dialogs like:
“Goodbye,” said John.

Tip for novice programmers: you can call the fetch function when building a list.
tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]

This allowed to get dialogs in one line. (you just need to specify a condition to remove all results without a tail)

CRFsuite


Perhaps this is the most curious part for you. It contains conditionally random fields, whatever they mean, and is launched from the command line, but there is no way to see how it works from the inside.

But in fact, CRFsuite is a very simple and interesting part of it all. While writing the material, I found that it has a library for Python, but now we will not complicate everything and will use the executable file using the command line.

(I plan to update the model when the next book, “Winds of Winter”, sees the light of day. But I have a couple more years until this happens)

All CRFsuite needs is a text with some tab-delimited properties, such as these:

FN 0 Graf Sent Len = 4 FN 1 = True FN -2 = True FN 0 = True NN 1 = True

This is a format for training data. The first attribute is the correct answer. All subsequent properties. They may look like you want, but do not use a colon - this is for weighted properties, and therefore can lead to a false interpretation.

You need to open the command line wherever crfsuite.exe is located and type the following there:
crfsuite learn -m asoiaf.model train.txt

This will create a model, which is the brain of everything. You can call her anything you like, I called my asoiaf. To look at the accuracy of the model, type this:
crfsuite tag -qt -m asoiaf.model test.txt

To actually run the model for tagging, type
crfsuite tag -m asoiaf.model untagged.txt

untagged.txt should look like train.txt , but without the attribute of the correct answer at the beginning, i.e. something like this:

NN -1 = True FN 0 = True FN 2 = True FN -1 = True NN 0 = True

Here you can learn more about it.

Let's play around with a lot of properties that can improve the accuracy of the model. We will start with the simplest: with boolean values ​​that determine the location of the position labels in and near the paragraph.

And again, our class for extracting properties, only now with a few new features at the beginning.
class feature_extractor:
    """Analyze dialogue features of a paragraph. Paragraph should be an lxml node."""
    def __init__(self, paragraph_node, particles, tag_distance=0):
        self.paragraph = paragraph_node
        self.particles = set(particles)
        self.tag_distance = tag_distance
        self.raw = ''.join(t for t in self.paragraph.itertext())
        self.tokens = self.tokenize(self.raw)
        self.speaker = self.xpath_find_speaker()
    def features(self):
        features = {}
        features.update(self.pre_speak())
        features.update(self.dur_speak())
        features.update(self.post_speak())
        return features
    def local_features(self):
        #Разнообразные свойства живут в этой функции как в комуналке
        features = []
        if self.tokens.count(u"\u201c") == 0:
            features.append("NoQuotes=True")
        prior = self.paragraph.getprevious()
        try:
            last_dialogue = list(prior.itertext("dialogue", with_tail=False))[-1].lower()
            hits = [w for w in ['who', 'you', 'name', '?'] if w in last_dialogue]
            if len(hits) > 2:
                features.append("Who Are You?=True:10.0")
        except (AttributeError, IndexError):
            pass
        try:
            dialogue = list(self.paragraph.itertext("dialogue", with_tail=False))[0].lower()
            for token in ['name', 'i am', u'i\u2019m']:
                if token in dialogue:
                    features.append("My Name=True:10.0")
                    break
        except (AttributeError, IndexError):
            pass
        if self.tokens[0] in self.particles:
            features.append("FirstSpeakerIndex0=True")
        if self.paragraph.text is not None:
            name_precount = len(self.find_speakers(self.tokenize(self.paragraph.text)))
            if name_precount > 2:
                features.append("Many Names Before=True")
            conjunctions = set([w.lower() for w in self.tokenize(self.paragraph.text)]).intersection(set(['and', 'but', 'while', 'then']))
            if len(conjunctions) > 0 and self.paragraph.find("dialogue") is not None:
                features.append("Conjunction in Head=True")
        short_threshold = 10
        if len(self.tokens) <= short_threshold:
            features.append("Short Graf=True")
        dialogue_length = sum(map(len, self.paragraph.xpath(".//dialogue/text()")))
        dialogue_ratio = dialogue_length / len(self.raw)
        if dialogue_ratio == 1:
            features.append("All Talk=True")
        elif dialogue_ratio >= 0.7:
            features.append("Mostly Talk=True")
        elif dialogue_ratio < 0.3 and not self.tokens < short_threshold:
            features.append("Little Talk=True")
        return features
    def feature_booleans(self):
        bool_features = []
        for tag in ["PS", "FN", "NN", "ADR", ]:
            label = "{} {}".format(tag, self.tag_distance)
            if label in self.features().keys():
                bool_features.append("{}=True".format(label))
            else:
                bool_features.append("{}=False".format(label))
        return bool_features
    def tokenize(self, string):
        return nltk.wordpunct_tokenize(string)
    def find_speakers(self, tokens):
        speakers = {}
        particle_indices = [i for (i, w) in enumerate(tokens) if w in self.particles]
        for k, g in itertools.groupby(enumerate(particle_indices), lambda (i,x): i-x):
            index_run = map(itemgetter(1), g)
            speaker_name = ' '.join(tokens[i] for i in index_run)
            speakers[min(index_run)] = speaker_name
        return speakers
    def xpath_find_speaker(self):
        speakers = self.paragraph.xpath(".//@speaker")
        if speakers == []:
            return "NULL"
        else:
            return speakers[0]
    def pre_speak(self, prior_tag="FN", near_tag="NN"):
        # Имена перед диалогом
        features = {}
        if self.paragraph.text is not None:
            speakers = self.find_speakers(self.tokenize(self.paragraph.text))
            if len(speakers) > 0:
                features.update({"{} {}".format(prior_tag,self.tag_distance): speakers.values()[0]})
            if len(speakers) > 1:
                features.update({"{} {}".format(near_tag,self.tag_distance): speakers[max(speakers.keys())]})
        return features
    def dur_speak(self, tag="ADR"):
        # Адресаты в диалоге
        features = {}
        for dialogue in self.paragraph.itertext("dialogue", with_tail=False):
            tokens = self.tokenize(dialogue)
            named = self.find_speakers(tokens)
            addressed = {k: v for (k, v) in named.items() if tokens[k-1] == "," or tokens[k + 1 + v.count(" ")].startswith(",")}
            if len(addressed) > 0:
                features.update({"{} {}".format(tag, self.tag_distance): addressed[max(addressed.keys())]})
        return features
    def post_speak(self, tag="PS"):
        features = {}
        # Имена поле диалога
        tails = [line.tail for line in self.paragraph.iterfind("dialogue") if line.tail is not None]
        for tail in tails:
            tokens = self.tokenize(tail)
            speakers = {k: v for (k, v) in self.find_speakers(tokens).items() if k <= 1}
            if len(speakers) > 0:
                features.update({"{} {}".format(tag, self.tag_distance): speakers[min(speakers.keys())]})
                break
        return features
paragraph = tree.xpath(".//paragraph")[-1]
example_extractor = feature_extractor(paragraph, particles)
print example_extractor.raw
print example_extractor.features()
print example_extractor.local_features()
print example_extractor.feature_booleans()
And in their hands, the daggers.
{}
['NoQuotes = True', 'Short Graf = True', 'Little Talk = True']
['PS 0 = False', 'FN 0 = False', 'NN 0 = False', 'ADR 0 = False']

Last night, during an undocumented insanity with machine learning, I tried to improve many properties. Below are some drafts acceptable for publication.

Option 1: Only True Positional Boolean Values
Label Count Recall
PS 0  	207    	0.9949
FN 0   	185    	0.95
NULL   	118    	0.3492
OTHER  	56     	0.3939
PS - 2 	44     	0.5238
Item accuracy: 430 / 678 (0.6342)

Further we will meet a lot of such statistics, and therefore, let's immediately determine what they mean.

Imagine that we are at dinner, looking at people. I asked you to determine if a random passerby is an Illuminati. You, as a person who fully believes in conspiracy theories, finish eating dumplings and start tagging passers-by.

Accuracy ( approx. Precision ), a value that will not be considered here, shows the frequency of errors of the first kind. In other words, how often you mistakenly ranked a person among the Illuminati.

Completeness ( approx. Recall ) measures the number of labels in the verification data that the model has correctly determined.

And F1 is a combination of both labels. You can see that if you classify all people as Illuminati, this will ensure maximum completeness and negligible accuracy.

Because everything is marked by me, then I'm not very interested in the accuracy of the model. I need completeness and accuracy.

In the first version of the properties, I considered only true Boolean values. Those. in the paragraph above, all sets were of the form “ADR 0 = True” and “PS 0 = True”. Accuracy ( approx. Item accuracy ) was 63.4%.

63.4% is this good? Based on the fact that NULL, PS 0 and FN 0 make up three quarters of our test data, and they are easily found by nature, we can definitely do better. Now add the rest of the positional boolean values, false.

Option 2: all positional Boolean values
Label Count Recall
NULL 254 0.9048
PS 0 204 0.9899
FN 0 149 0.975
OTHER 24 0.2273
PS - 2 19 0.2857
Item accuracy: 515/678 (0.7596)

Now we perfectly define simple cases and get decent accuracy. 75% means that you only need to mark the first book, “Game of Thrones,” and one-third of “The Battle of the Kings,” and the model to determine the remaining three quarters of the books itself. It takes many hours of work, but within reason.

And yet, I see no reason why not define NULL tags with a completeness of 98% +, so let's add a property aimed at that.

Option 3: quotation marks?
Label Count Recall
PS 0 218 0.9907
NULL 180 0.9119
FN 0 167 0.9118
OTHER 63 0.3784
PS 2 25 0.5
Item accuracy: 550/710 (0.7746)

We count the number of opening quotes in the paragraph.

I want to say that I am surprised that NULL has not become more accurate. We need to work on this. Further I would like to improve FN 0.

Option 4: index of the first name?
Label Count Recall
PS 0 218 0.9907
NULL 183 0.9057
FN 0 157 0.8971
OTHER 68 0.4189
PS - 2 23 0.5484
Item accuracy: 551/710 (0.7761)

This property contains the index of the first name.

hmm ... maybe too complicated, let's get back to boolean values ​​again.

Option 5: index 0 name? + redundancy
Label Count Recall
PS 0 216 0.986
FN 0 166 0.9265
NULL 160 1
OTHER 85 0.5811
PS 2 32 0.7143
Item accuracy: 578/710 (0.8141)

Here it is! I did not correctly count the number of opening quotes, thereby spoiling the result.

As soon as I fixed it, NULL is determined perfectly ... but now we have run out of easy ways to improve the model. Now I really need to contrive to further improve the result! Let's see if this works ...

Option 6: After the speaker (PS) + and - 2
Here we will use a Boolean value if the speaker is two paragraphs above or below the current one. In theory, this should increase the result of PS -2.
Label Count Recall
PS 0 216 0.986
FN 0 166 0.9265
NULL 160 1
OTHER 84 0.5676
PS 2 32 0.7143
Item accuracy: 578/710 (0.8141)

No effect!

Option 7: sequences ??
Label Count Recall
PS 0 217 0.986
FN 0 168 0.9265
NULL 160 1
OTHER 82 0.5541
PS 2 30 0.6429
Item accuracy: 576/710 (0.8113) Instance accuracy: 56/142 (0.3944)

Wait! It turned out that CRF can work with sequences, and in fact this is its meaning. I ignored the instance accuracy value ( approx. Instance accuracy ), because it was always 0/1, which means that the model viewed the entire text as one long dialogue.

Sorry, I need to slap myself. Assuming that we increase accuracy - and this is an open question - how do we use this functionality? I tried to indicate the length of each sequence in 5 paragraphs, but this does not seem to me correct.

Perhaps if two consecutive NULLs meet, then this will be a sequence, assuming that the conversation is completed.

After playing around with it, I was not able to build a model that would work with conversations. As I understand it, it should have many special transition weights ( approx. Transition weights ), depending on the position in the sequence. Thus, the model will make different decisions, depending on our position, at the beginning, middle or at the end of the conversation.

But nothing in the behavior of the model shows that this is happening. In the near future, I will play a little with other properties. Oh yes, let's take a look at the script that generates our training and test data. It is not optimized because calculates properties for each paragraph 5 times. I will leave it as it is for this material, but keep in mind that it can be accelerated by using one cycle to preserve the Boolean properties of paragraphs and a second to add to existing ones.
tree = ASOIAFtoXML([{"title": "ASOIAF", "contents": "corpus/train_asoiaf_pos_tagged.txt"}])
paragraphs = tree.xpath(".//paragraph")
In [29]:
def prep_test_data(paragraphs):
    max_index = len(paragraphs)
    results = []
    for index, paragraph in enumerate(paragraphs):
        extractor = feature_extractor(paragraph, set_of_particles)
        all_features = extractor.local_features() + extractor.feature_booleans()
        for n in [-2, -1, 1, 2]:
            if 0 <= n+index < max_index:
                neighbor_features = feature_extractor(paragraphs[index + n], set_of_particles, tag_distance = n).feature_booleans()
                if neighbor_features:
                    all_features += neighbor_features      
        all_features.insert(0, extractor.speaker)
        results.append("\t".join(all_features))
    return results
results = prep_test_data(paragraphs)
In [31]:
max_index = len(results)
with codecs.open(r"new_test.txt", "w", "utf-8") as output:
    for line in results[:int(max_index/2)]:
            output.write(line + '\n')
with codecs.open(r"new_train.txt", "w", "utf-8") as output:
    for line in results[int(max_index/2):]:
            output.write(line + '\n')

More properties


I tried several other properties:
  • Counting the number of names before the first line of the dialog. In theory, this is the place with the most NN. There is no result.
  • A property that marks a paragraph in whole or in part is a dialogue. This helped to improve the situation with PS -2 and FN -2, but the difference was not significant.
  • Short / long paragraphs. Not much use.
  • “And” or “but” in the text before the dialogue. (in an attempt to focus on NN 0, where they were ignored)

I thought that the latter was a rather deft move, but it didn’t work and we didn’t get an accuracy higher than 81%.

I tried to change the training data with the verification and this gave 84%. You should not spend a lot of time improving many properties for certain data, as this leads to retraining. In fact, mixing training data and test data is a good idea. I did not mix them, because I thought that this would lead to damage to the sequences, but we don’t use them anymore, so why not? We will mix them.

A little mixed data

Received 82%.

Okay! I think here we have reached the limit of my skills.

Will there be no continuation?


Let's summarize and talk about what can be done next.
  • Улучшить обучающие данные. Сейчас у меня есть 700 параграфов для обучения и проверки. Всего их около 40000. Радует, что я пометил 1.7% и получил в результате 80%. (хотя я сомневаюсь в 80%, на практике это было больше похоже на 75%.) Что произойдёт если мы используем 10000 параграфов для обучения? Ценность дополнительных тренировочных данных не сильно возрастает с увеличением их количества, но для редких меток, как ADR, попросту недостаточно моих 700 параграфов.
  • Тщательно изучить документацию CRFsuite. Там наверняка найдутся параметры, которые мне помогут.
  • Поэкспериментировать со взвешенными свойствами.
  • Действительно попробовать воспользоваться последовательностями.
  • Написать оболочку для Python. Кроме как это позволит избежать переброса данных, это так же очень поможет улучшить модель. Например, я могу сделать…
  • Запасную метку. Слабым местом моей модели является то, что слишком много меток OTHER. При пометке OTHER, модель как бы говорит, смотри, я знаю что кто то говорит в этом параграфе, но ты не дал мне способа его определить. Поэтому никакой положительной стороны у метки OTHER нет — улучшением будет даже самое нелепое тыканье в темноту.
  • Метку пола. Мне нравятся диалоги с именами в конце. Этого не происходит в разговорах мужчин и женщин. Так, много диалогов Кейтилин Старк остались не помеченными потому, что, как правило, она единственная женщина в окружении и поэтому удобнее обращаться к ней просто как «она». Это будет достаточно легко обойти; мы можем искать все «она» в после диалоговой части и потом проверить заголовок главы и если это женское имя, то она нашлась.


Заключение


Good! I hope this was useful to someone. Thank you for reading and if you want to contact me, then I am on Twitter.

Also, I would like to note that all of the above was done for a large critical study of the Game of Thrones. If you are a fan of these books and would like to read the analysis, which was possible thanks to the dialogue label, then I will publish everything soon.

Also popular now: