Durham July 10, 2015 at 14:28

A simple method for extracting relationships and facts from text

Earlier, we wrote about the analysis of reviews about restaurants, in order to extract references to various aspects (food, decor, and the like). Recently, in the comments, the question arose of extracting factual information from the text, i.e. is it possible, for example, to extract facts from car reviews, for example, “gearbox breaks down quickly” => breaks down (gearbox, quickly) so that you can work with these facts later. In this article, we describe one approach to solving such a problem.

The method that we will talk about is based on a number of simplifications, it is not the most accurate, but it is easy to implement and allows you to quickly create a prototype application in which it should be used. In some cases, it will be quite enough, but for others it is possible to introduce improvements without departing from the basic principle.

Consider, for example, such a proposal in the review of the TV bracket:

All openings with the TV match, the washers do not fall out .

We want to extract relationships from it, for example, in the following form:

predicate =>
subject coincides => holes
object => with a TV

predicate =>
subject does not fall out => washers

This task is often called Semantic Role Labeling. There is a certain verb (coincides, falls out, etc.) and there are its arguments. What are the arguments of the verb and what they are, is the subject of debate among linguists. In practice, they are what are needed for a particular task. Therefore, in order not to plunge deeply into philosophical problems, we determine that the subject, object, circumstance / conditions under which the action will be needed will be needed. A description can also be attached to an object or subject, for example, in the phrase good TV, the word good plays this role. If the description is not a characteristic of the quality of the object, but its component (plasma TV), then we will allocate this in a separate class. To begin with, this will be enough, but a little lower we will return to this issue.

Now we will try to reduce the problem of extracting relations to the problem of annotating the sequence, the solution of which we discussed earlier .

All
holes	subject
from
TV	an object
match	predicate
,
washers	an object
not	predicate
fall out	predicate

Put in front of each word the corresponding category and mark out in this way a training sample of sufficient size. Next, we can train any classifier that can work with sequences, for example, CRF, after which, by submitting a new sentence, we can get a category prediction for each word. Of course, we used our API for experiments, access to which everyone can get for free by registering on our site. About how to use it, we wrote in detail here , so we will not repeat here, so as not to lose the main idea.

We manually marked about 100 sentences, which is actually a very small sample for such a task. Next, we submitted several new offers to the model’s input and this is what happened:
First sentence:

Changed	predicate
awl	an object
on the
soap	an object
behind
such
money	an object
,
bought	predicate
on the
title	an object

Second sentence:

IN
complete	an object
besides
the usual
the knife	an object
there is	predicate
for
dotted	description
notches	an object

At this stage, we noticed that the connection between the verb and the object corresponding to it was lost (in fact, we knew this right away, but for simplicity of presentation we did not say).

There are various ways to solve this problem. In highly specialized systems, the type of argument itself can say which verb it belongs to:

The train arrives at the station at 16-00 , and leaves at 15-20

It’s clear that here we can immediately mark 16-00 as arrival_time , and 15-20 as departure_time , while the verb will also have a type, for example, it will be of the type “ arrival ”, and serving the type of “ departure". Thus, the question of the correct comparison is transferred to the sequence annotation system, and whether it copes with it or not, will depend on the algorithm used.

This approach is well suited for analyzing teams (“wake me up tomorrow at 10 am => wake up (me tomorrow at 10 am,“ order pizza for 10 people ”=> order (pizza, 10 people), etc.)

in our case, we could define the types of arguments more precisely. Say, in the phrase, “all the holes on the TV match”, we would have the relationship type “ match ”, and the two arguments “ what_coincides ” and “ what_check_coincides .” And it works fine when the number of relations is limited, and strictly defined.

At first, we chose a more general scheme, when the types of arguments are very vague, in the hope that they will fit any verb. As a result of this, we will need the second phase of the analysis - determining which arguments correspond to which verb.

Since we are making a simple method of extracting facts, we assume that all objects belong to the verbs closest to them. This is not always the case, but often true. The same method can be used to correlate objects of their description.

Accepting this simplification, we wrote a program that first searches for all certain verbs, and then carries the corresponding objects to them, counting the distance to the verb in words. Using this program, from the above sentences, we were able to single out the following relationships:

first sentence:

predicate => changed
object1 => sewed
object2 => soap

predicate => bought
object => name

second sentence:

predicate => there is
object1 => notch
description => dashed
object2 => knife
description => ordinary
object3 => set The result

was quite an interesting thing, while For all the work, including manual annotation of the training sample, it took us 4 hours. To improve quality, you can collect all the selected facts together at the second stage of the analysis and try to discard incorrectly defined relationships based on the analysis of the results.

In general, as we see, the task of extracting relationships from texts can be solved in different ways. We examined only a few, trying to focus on the available methods, and, as you can see, to date, to build such an analyzer is not so difficult.

Tags:

A simple method for extracting relationships and facts from text

Also popular now: