eyeofhell June 9, 2009 at 12:23

Regexp and Python: extracting tokens from text

Tutorial

Analysis of logs and configuration files is a frequently arising and repeatedly described task. In this article, I will tell you how to implement its classic solution in python: using regular expressions and named groups. If possible, I will try to tell you the reasons why this or that solution is used, as well as outline the pitfalls and methods of their circumvention.

Why parse text and who are tokens

In text files of interest to our programs, you usually find more than one unit of information. So that the program can separate one piece of information from another, we set file formats - that is, an agreement on how the text is written inside the file. The simplest format - each unit of information is on a separate line. Such a file almost does not require additional processing - just consider it the means of the programming language used and break it into lines. Most languages allow you to split a file into lines with one or two commands. Unfortunately, most of the files that need to be processed have a slightly more complex format. For example, a classic settings file contains lines of the form name = value. In the general case, such a format is also quite easy to parse by reading the file line by line and finding '=' in each line. That to the left of it will be the name of the field, and that to the right will be the value. This logic will work until we need to parse a file with multi-line field values and values that contain the symbol "=". Attempts to process such a file quickly lead to the appearance of numerous checks, loops, and other difficulties in the code. Therefore, for text files, which are more complex in structure than a list of lines, the method of splitting into tokens using regular expressions has long been successfully used. The term “token” is usually understood as a small part of the text located in a certain place of this text and having a certain meaning. For example, in the following fragment of the configuration file: until we need to parse a file with multi-line field values and values that contain the symbol "=". Attempts to process such a file quickly lead to the appearance of numerous checks, loops, and other difficulties in the code. Therefore, for text files, which are more complex in structure than a list of lines, the method of splitting into tokens using regular expressions has long been successfully used. The term “token” is usually understood as a small part of the text located in a certain place of this text and having a certain meaning. For example, in the following fragment of the configuration file: until we need to parse a file with multi-line field values and values that contain the symbol "=". Attempts to process such a file quickly lead to the appearance of numerous checks, loops, and other difficulties in the code. Therefore, for text files, which are more complex in structure than a list of lines, the method of splitting into tokens using regular expressions has long been successfully used. The term “token” is usually understood as a small part of the text located in a certain place of this text and having a certain meaning. For example, in the following fragment of the configuration file: than a list of strings, the method of splitting into tokens using regular expressions has long been successfully used. The term “token” is usually understood as a small part of the text located in a certain place of this text and having a certain meaning. For example, in the following fragment of the configuration file: than a list of strings, the method of splitting into tokens using regular expressions has long been successfully used. The term “token” is usually understood as a small part of the text located in a certain place of this text and having a certain meaning. For example, in the following fragment of the configuration file:

Three tokens can be distinguished: “name” as the name of the field, “=” as the separator and “Vasya” as the value of the field. Strictly speaking, what I call tokens in this article is more suitable for the definition of lexeme. The difference between the two is that the token is a piece of text of a certain format without regard to its position relative to other pieces of text. Complex parsers, for example, those used in compilers, at the beginning break the text into tokens and then process the list of tokens with the help of a large and branchy state machine, which already tokens from tokens.
Fortunately, python has a very good library for working with regular expressions, which allows you to solve most text processing tasks in one pass, without an intermediate search for tokens and their subsequent conversion to tokens.

Who are regular expressions?

Regular expressions are such, such ... If very briefly, then this is such a programming language that is designed to search for text. Very, very simple programming language. Conditions are practically not applied to it, there are no cycles and functions, there is only one expression that describes what text we want to find. But this is the only expression can be very long :). To successfully use regular expressions in general and in python in particular, you need to know a few things. First, every self-respecting regular expression library uses its own syntax for these regular expressions. In general, the syntax is similar, but the details and additional features can vary very much - therefore, before using regular expressions in python, you need to familiarize yourself with the syntax inofficial documentation .
Secondly, regular expressions do not separate the syntax of the language itself and user data. That is, if we want to find the word "Vasya", then the regular expression for its search will look like that - "Vasya". The programming language itself is not present on this line; only the line we have specified is present, which must be sought. But if we want to find the word “Vasya”, followed by a comma or semicolon, then the regular expression will be surrounded by necessary and important details: “Vasya, | Vasya;”. As we can see, the construction of the “logical or” language, which is represented by a vertical bar, was added here. At the same time, the lines we set are not separated from the syntax of the language. This leads to an important and unpleasant consequence - if we want to set a character in the search string that is present in the syntax of the language, then we need to write before it "\". So the regular expression that searches for the word “Vasya” followed by a period or the question mark will look like this: “Vasya \. | Vasya \?”. Both the period and the question mark are used in the syntax of the regular expression language :(.
Third, regular expressions are greedy by default. That is, if you do not specifically indicate this, then a string of maximum length that satisfies the regular expression will be found. For example, if we want to find strings of the form “name = value” in the text and write the regular expression: “. + =. +”, Then for the text “a = b” it will work correctly by returning “a = b”. But for the text “a = b, c = d”, it will return “a = b, c = d” - that is, the whole text. One must always remember this property of regular expressions and write them in such a way that the library does not have the temptation to return half of the “war and peace” as a result of the search. For example, the previous regular expression needs to be modified a little: “[^ =] + = [^ =] +” - this version will take into account that the text before and after the symbol “=” should not contain the symbol “=” itself.

We are looking for a token in the text

The python regular expression library is called "re". The main function is essentially the same - search (). Passing the regular expression as the first argument, the second as the text to search in, we get the search result at the output. Please note - for a string with a regular expression, it is better to use the prefix "r" so that the characters "\" do not convert to string escape sequences. Search example:

import re
match = re.search (ur "Vasya \. | Vasya \?", u "Vasya?")
print match.group (). encode ("cp1251")

As you can see from the example, the search () method returns an object of type 'search result', which has several methods and fields with which you can get the text found, its position in the original expression, and other necessary and useful properties. Consider a more vital example - the classic configuration file, consisting of section names in curly brackets, field names and their values. The regular expression to search for section names will look like this:

import re
txt = '' '
{number section}
num = 1
{text section}
txt = "2"
' ''
match = re.search (ur "{[^}] +}", txt)
print match.group ( )

The result of this code will be the string "{number section}" - the section name was successfully found.

We are looking for all instances of the token in the text

As you can see from the previous example, just calling re.search () will find only the first token in the text. To find all instances of the token, the re library offers several ways. The most correct one, in my opinion, is a call to the finditer () method, which returns a list of objects of type 'search result'. Getting these mysterious objects instead of the usual strings (which the findall method can return, for example), we get the opportunity not only to get acquainted with the fact that the text was found, but also to find out exactly where it was found - for this an object of type 'search result' has specifically trained span () method, which returns the exact position of the found fragment in the source text. The modified code to find all instances of the token using the finditer () method will look like this:

result = re.finditer (ur "{[^} \ n] +}", txt)
for match in result:
print match.group ()

We are looking for different tokens in the text

Unfortunately, the search for one token is, of course, an interesting thing - but practically of little use. Usually in a text, even as simple as a configuration file, there are many tokens of interest to us. In the example with the configuration file, this will be at least the section names, field names and field values. In order to search for several different tokens in the regular expression language, groups are used. Groups are fragments of a regular expression enclosed in parentheses - the parts of the text corresponding to these fragments will be returned as separate results. Thus, a regular expression that can look up sections, fields, and values will look like this:

result = re.finditer (ur "({[^} \ n] +}) | (?: ([^ = \ n] +) = ([^ \ n] +))", txt)
for match in result :
print match.groups ()

Please note that this code is significantly different from the previous one. Firstly, three groups are distinguished in the regular expression: "({[^} \ n] +})" corresponds to the heading in braces,
"([^ = \ n] +)" before the '=' sign matches the field name and "([^ = \ n] +)" after the '=' sign matches the field value. It also uses the strange group "(? :)", which combines groups of names and field values. This is a special group for use with the logical operator '|' - it allows you to combine several groups with one operator '|' no side effects. Secondly, the groups () method was used instead of the group () method to print the results. This is not casual - the library of regular expressions in python has its own idea of what a “search result” is. This autonomy is expressed in the fact that the regular expression of the two groups "([^ = \ n] +) = ([^ = \ n] +)" applied to the test "a = b" will return ONE object of type "result", which consists of several GROUPS.

We determine what exactly we found

If you run the previous example, then the screen displays approximately the following result:

('{number section}', None, None)
(None, 'num', '1')
('{text section}', None, None)
(None, 'txt', '“2”')

As you can see, for each result, the groups () method returns a magic list of three elements, each of which can be either None (empty) or found text. If you thoughtfully smoke the documentation, you can figure out that the library found three groups in our expression and now for each result it displays which groups are present in it. We see that the first group corresponds to the section name, the second to the field name and the third field value. So the first result, "{number section}", is the name of the section. The second result, “num = 1”, is the name of the field and the value of the field, and so on. As you can see, it is rather confusing and inconvenient - in the general case it is difficult to determine WHAT EXACTLY we found.
Groups can be named to answer this important question. For this, the regular expression language has a special syntax: "(? P <group_name> expression)". If we slightly modify our code and give three groups names, then everything will be much more convenient:

import re
txt = '' '
{number section}
num = 1
{text section}
txt = "2"
' ''
object = re.compile (ur "(? P
{[^} \ n] +}) | (? :(? P[^ = \ n] +) = (? P[^ \ n] +)) ", re.M | re.S | re.U)
result = object.finditer (txt)
group_name_by_index = dict ([(v, k) for k, v in object.groupindex.items ()])
print group_name_by_index
for match in result:
  for group_index, group in enumerate (match.groups ()):
    if group:
      print "text:% s"% group
      print "group:% s"% group_name_by_index [group_index + 1 ]
      print "position:% d,% d"% match.span (group_index + 1)

Pay attention to a number of cosmetic changes. Before the search, the re.compile () method is used, which returns the so-called “compiled regular expression”. In addition to speed and convenience, it has one remarkable property - if you call its groupindex () method, we get a dictionary containing the names of all found groups and from indices. Unfortunately, the dictionary is for some reason inverted - the key in it is not the index, but the name of the group. the scary expression with dict () corrects this annoying misunderstanding and the group_name_by_index dictionary can be used to get the name of the group by its number. The compilation also uses the flags re.M (the correct search for the beginning of the string "^" and the end of the string "$" in multi-line text), re.S ("." Finds absolutely everything, including \ n) and .U (correct search in unicode text). As a result, the analysis of the found takes two cycles - first we iterate over the search results, and then for each result the groups contained in it. The result is an accurate and complete list of tokens, indicating their type and position in the text. This list can be used for text processing, syntax highlighting, finding errors - in general, a thing is necessary and useful.

Conclusion

The demonstrated way of finding tokens in the text is not the best or the most correct - even within the framework of the python language and its standard library of regular expressions, there are at least a dozen alternative methods that are in no way inferior to this. But I hope that the above examples and explanations will help someone if necessary to quickly get in the course of things, saving time on Google and finding out why it does not work exactly as we would like. Good luck to everyone, waiting for comments.

Tags: