Extracting entities from text using Stanford NLP from scratch

From the sandbox

This article is intended for those who have never worked with Stanford nlp and are faced with the need to study it and apply it as soon as possible.

This software is quite common, and, in particular, our company - BaltInfoCom - uses this program.

First you need to understand a simple thing: Stanford NLP works on the principle of annotating words, that is, one or more annotations are “hung” on every word, for example, POS (Part of Speech is part of speech), NER (Named-Entity Recognizing is a named entity) and etc.

The first thing that a newbie sees when going to the Stanford NLP website in the “ quick start ” section is the following design:

Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,regexner,parse,depparse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = new CoreDocument(text);
// annnotate the document
pipeline.annotate(document);

Here, the StanfordCoreNLP is a pipeline, to which our text, pre-packaged in a CoreDocument object, is fed. StanfordCoreNLP is the most important and frequently used object in the entire structure, through which all the main work takes place.

First, we set the parameters in the StanfordCoreNLP and indicate which actions we need. In this case, all possible combinations of these parameters can be found on the official website at this link.

tokenize - accordingly splitting into tokens
ssplit - split into a sentence
pos - definition of the part of speech
lemma - add to each word its initial form
ner - the definition of named entities, such as "Organization", "Face", etc.
regexner - defining named entities using regular expressions
parse - analysis of each word by semantics (gender, number, etc.)
depparse - parsing syntactic dependencies between words in a sentence
coref- search for mentioning the same named entity in the text, for example, "Mary" and "she"

Here is an example of how the annotators (parse and depparse) work together:

If the annotations over tokens are not clear to you, then on these sites you will find their meanings: the meanings of the links in sentences , the meanings of parts of speech .

For each of these parameters, you can find additional flags for more fine-tuning here in the "Antotators" section.

These constructions are set if you want to use the built-in Stanford NLP models, but you can also set them manually using the addAnnotator (Annotator ...) method or via adding parameters before creating the StanfordCoreNLP object.

Now about how you can extract named entities from the text. For this, Stanford NLP has three built-in classes based on regular expressions and one class designed for marking tokens through the model.

Classes based on regular expressions:

TokensRegexAnnotator - an annotator working according to the rules - SequenceMatchRules .
Consider an example of mapping for it, built on these rules.
```
ner = { type: "CLASS", value:
"edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$EMAIL = "/.*([A-z0-9А-я]+?)(@)([A-z0-9А-я]+?).*/"
{
ruleType: "tokens",
pattern: (([]) ($EMAIL)),
action: (Annotate($0, ner, "MAIL")),
priority:0
} 
```
In the first line, we indicate what type of tags we will fill in this template.
In the second, we create a variable, which, according to the rules, must begin with the “$” character and be at the beginning of the line.

After that we create a block in which we set the type of rules. Then a pattern for comparison (in our case, we say that we need “[]” - any token after which our variable “$ EMAIL” comes. After that, we specify an action, in our case we want to annotate the token.

Please note that in the example, specifically “[]” and “$ EMAIL” are enclosed in parentheses, because $ 0 indicates which capture group we want to select from the template found, while under the capture group is meant the group enclosed in parentheses. If you specify 0, then in the phrase “mail sobaka@mail.ru” all tokens will be annotated as “MAIL”. If you specify 1 (that is, the first capture group), then only the word “mail” will be annotated; if 2, then only “sobaka@mail.ru”.

For situations where the same token can be defined differently by two rules, you can set the priority of the rule relative to the other. For example, in the case of the next phrase - “House $ 25”, there may be two contradictory rules, according to one of which the number 25 will be defined as the house number, and in the second, its value.
RegexNERAnnotator - this annotator works using the RegexNERSequenceClassifier classifier .

Mapping for him is as follows
```
regex1	TYPE	overwritableType1,Type2...	priority
```
Here regex1 is a regular expression in the format TokenSequencePattern .

TYPE is the name of the named entity.
overwritableType1, Type2 ... - types that we can replace in cases of dispute.
Priority - priority for the disputable situations described above.
Please note that in this mapping all columns must be separated by tabs.
TokensRegexNERAnnotator
This annotator differs from the previous one in that it uses the TokensRegex library for regular expressions, the same as the first annotator, which allows using more flexible rules for matching; as well as those that can write values of tags other than the NER tag.
Mapping for it is made according to the rules of RegexNERAnnotator

Marking text through a model using NERClassifierCombiner
In order to use this class, you must first have or train your model.

How to do this can be found here ;
After you have trained the model, all that remains is to create a NERClassifierCombiner, specify the model path in it, and call the classify method.

NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, serialized_model);
String text = "Some lucky people working in BaltInfoCom Org.";
List<List<CoreLabel>> out = classifier.classify(text);

A complete list of annotators can be found here .

In addition to the above, if you need to use Stanford NLP for the Russian language, I can advise you to go here . There are models for determining parts of speech (pos-tagger) and for identifying relationships in a sentence (dependency parser).

Types of taggers presented there:
russian-ud-pos.tagger - just a tagger,
russian-ud-mfmini.tagger - with a basic list of morphological features,
russian-ud-mf.tagger - with a complete list of morphological features, an example of which you can use look here .

Tags:

Extracting entities from text using Stanford NLP from scratch

Also popular now: