A simple method for extracting relationships and facts from text

    Earlier, we wrote about the analysis of reviews about restaurants, in order to extract references to various aspects (food, decor, and the like). Recently, in the comments, the question arose of extracting factual information from the text, i.e. is it possible, for example, to extract facts from car reviews, for example, “gearbox breaks down quickly” => breaks down (gearbox, quickly) so that you can work with these facts later. In this article, we describe one approach to solving such a problem.



    The method that we will talk about is based on a number of simplifications, it is not the most accurate, but it is easy to implement and allows you to quickly create a prototype application in which it should be used. In some cases, it will be quite enough, but for others it is possible to introduce improvements without departing from the basic principle.

    Consider, for example, such a proposal in the review of the TV bracket:

    All openings with the TV match, the washers do not fall out .

    We want to extract relationships from it, for example, in the following form:

    predicate =>
    subject coincides => holes
    object => with a TV

    predicate =>
    subject does not fall out => washers

    This task is often called Semantic Role Labeling. There is a certain verb (coincides, falls out, etc.) and there are its arguments. What are the arguments of the verb and what they are, is the subject of debate among linguists. In practice, they are what are needed for a particular task. Therefore, in order not to plunge deeply into philosophical problems, we determine that the subject, object, circumstance / conditions under which the action will be needed will be needed. A description can also be attached to an object or subject, for example, in the phrase good TV, the word good plays this role. If the description is not a characteristic of the quality of the object, but its component (plasma TV), then we will allocate this in a separate class. To begin with, this will be enough, but a little lower we will return to this issue.

    Now we will try to reduce the problem of extracting relations to the problem of annotating the sequence, the solution of which we discussed earlier .
    All
    holessubject
    from
    TVan object
    matchpredicate
    ,
    washersan object
    notpredicate
    fall outpredicate

    Put in front of each word the corresponding category and mark out in this way a training sample of sufficient size. Next, we can train any classifier that can work with sequences, for example, CRF, after which, by submitting a new sentence, we can get a category prediction for each word. Of course, we used our API for experiments, access to which everyone can get for free by registering on our site. About how to use it, we wrote in detail here , so we will not repeat here, so as not to lose the main idea.

    We manually marked about 100 sentences, which is actually a very small sample for such a task. Next, we submitted several new offers to the model’s input and this is what happened:
    First sentence:
    Changedpredicate
    awlan object
    on the
    soapan object
    behind
    such
    moneyan object
    ,
    boughtpredicate
    on the
    titlean object

    Second sentence:
    IN
    completean object
    besides
    the usual
    the knifean object
    there ispredicate
    for
    dotteddescription
    notchesan object

    At this stage, we noticed that the connection between the verb and the object corresponding to it was lost (in fact, we knew this right away, but for simplicity of presentation we did not say).

    There are various ways to solve this problem. In highly specialized systems, the type of argument itself can say which verb it belongs to:

    The train arrives at the station at 16-00 , and leaves at 15-20

    It’s clear that here we can immediately mark 16-00 as arrival_time , and 15-20 as departure_time , while the verb will also have a type, for example, it will be of the type “ arrival ”, and serving the type of “ departure". Thus, the question of the correct comparison is transferred to the sequence annotation system, and whether it copes with it or not, will depend on the algorithm used.

    This approach is well suited for analyzing teams (“wake me up tomorrow at 10 am => wake up (me tomorrow at 10 am,“ order pizza for 10 people ”=> order (pizza, 10 people), etc.)

    in our case, we could define the types of arguments more precisely. Say, in the phrase, “all the holes on the TV match”, we would have the relationship type “ match ”, and the two arguments “ what_coincides ” and “ what_check_coincides .” And it works fine when the number of relations is limited, and strictly defined.

    At first, we chose a more general scheme, when the types of arguments are very vague, in the hope that they will fit any verb. As a result of this, we will need the second phase of the analysis - determining which arguments correspond to which verb.

    Since we are making a simple method of extracting facts, we assume that all objects belong to the verbs closest to them. This is not always the case, but often true. The same method can be used to correlate objects of their description.

    Accepting this simplification, we wrote a program that first searches for all certain verbs, and then carries the corresponding objects to them, counting the distance to the verb in words. Using this program, from the above sentences, we were able to single out the following relationships:

    first sentence:

    predicate => changed
    object1 => sewed
    object2 => soap

    predicate => bought
    object => name

    second sentence:

    predicate => there is
    object1 => notch
    description => dashed
    object2 => knife
    description => ordinary
    object3 => set The result

    was quite an interesting thing, while For all the work, including manual annotation of the training sample, it took us 4 hours. To improve quality, you can collect all the selected facts together at the second stage of the analysis and try to discard incorrectly defined relationships based on the analysis of the results.

    In general, as we see, the task of extracting relationships from texts can be solved in different ways. We examined only a few, trying to focus on the available methods, and, as you can see, to date, to build such an analyzer is not so difficult.

    Also popular now: