Regular expressions, a guide for beginners. Part 1

Original author: AM Kuchling
  • Transfer
Regular expressions (RVs) are essentially a tiny programming language built into Python and accessible using the re module. Using it, you specify the rules for the set of possible lines that you want to check; this set can contain English phrases, or email addresses, or TeX commands, or whatever. With PB, you can ask questions such as “Does this line match the pattern?”, Or “Does the pattern match this line somewhere?”. You can also use regular expressions to modify a string or break it into pieces in various ways.

Regular expression patterns are compiled into a series of bytecode, which are then executed by the corresponding engine written in C. For advanced use, it may be important to pay attention to how the engine will execute this regular expression and write it so that it will produce a bytecode that works faster. Optimization is not considered in this document, since it requires you to have a good understanding of the internal details of the engine.

The regular expression language is relatively small and limited, so not all possible string processing tasks can be done using regular expressions. There are also tasks that you candone with regular expressions, but the expressions are too complex. In these cases, it might be better to write regular Python code, even if it will run slower than the developed regular expression, but it will be more understandable.

Simple patterns


We'll start by exploring the simplest regular expressions. Since regular expressions are used to work with strings, we will start with the most common task - matching characters.

For a detailed explanation of the technical side of regular expressions (deterministic and non-deterministic finite state machines), you can refer to almost any textbook on writing compilers.

Character Matching

Most letters and symbols correspond to themselves. For example, the regular expression testwill exactly match the string test(you can enable case-insensitive mode, which will also allow this regular expression to match Testor TEST, but more on that later).

There are exceptions to this rule; some characters are special metacharacters , and do not correspond to themselves. Instead, they indicate that some unusual thing must be found, or affect other parts of the regular expression by repeating or changing their meaning. Most of this tutorial is devoted to discussing various metacharacters and what they do.

Here is a complete list of metacharacters; their meanings will be discussed in the rest of this HOWTO.

. ^ $ * + ? { [ ] \ | ( )

The first metacharacters that we consider this [and ]. They are used to determine the character class, which is the set of characters with which you are looking for a match. Characters can be listed individually, or in the form of a range of characters, designated first and last characters, separated by '-'. For example, [abc]will match any of the characters a, bor c; it is the same as an expression [a-c]that uses a range to specify the same set of characters. If you want to match only lowercase letters, the PB will look like this [a-z].

Metacharacters are not active inside classes. For example, [akm$]will match any of the characters 'a', 'k', 'm'or '$'. Sign'$'it is usually a metacharacter (as can be seen from the list of characters above), but inside the character class it loses its special nature.

To match characters outside this class, a character is added at the beginning of the class '^'. For example, an expression [^5] matches any character except '5'.

Perhaps the most important metacharacter a backslash \. As with Python string literals, backslashes can be followed by various characters representing different special sequences. It is also used to escape metacharacters so that they can be used in templates; for example, if you need to find a match [or \, in order to deprive them of their special role as metacharacters, you need to put a backslash before it:\[or \\.

Some of the special sequences starting with '\'represent predefined character sets that are often useful, such as a set of numbers, a set of letters, or the set of all that are not spaces, tabs, etc. (whitespace). The following predefined sequences are a subset of them. See the last part of Regular Expression Syntax for a complete list of class sequences and extended class definitions for Unicode strings .

\d
Matches any digit; class equivalent [0-9].
\D
Matches any non-numeric character; class equivalent [^0-9].
\s
Matches any whitespace character; equivalent [ \t\n\r\f\v].
\S
Matches any non-whitespace character; equivalent [^ \t\n\r\f\v].
\w
Matches any letter or number; equivalent [a-zA-Z0-9_].
\W
On the contrary; equivalent [^a-zA-Z0-9_].

These sequences can be included in a character class. For example, [\ s ,.] is a character class that will match any whitespace-symbol or comma or dot.

The last metacharacter in this section is this '.'. It matches all characters except the newline character, but there is an alternative mode ( re.DOTALL) where this set will include it. '.'often used where you want to match "any character".

Repetitive things

The ability to match different character sets is the first thing that regular expressions can do and that cannot always be done with string methods. However, if this were the only additional opportunity, they would not be so interesting. Another possibility is that you can specify how many times a part of the regular expression should be repeated.

The first metacharacter to repeat this *. It indicates that the previous character can be matched zero or more times, instead of a single comparison.

For example, ca*twill match ct(0 characters a), cat (1 character a), caaat (3 characters a), and so on. The regex engine has various internal size restrictions.inttype for C, which does not allow him to match more than 2 billion characters 'a'. (I hope you do not need this).

Repetitions, such as *called greedy ; the engine will try to repeat it as many times as possible. If the following parts of the template do not match, the engine will go back and try again with a few repetitions of the character.

A step-by-step examination of an example will make the explanation clearer. Let's look at the expression a[bcd]*b. It matches a letter 'a', zero or more characters from the class [bcd], and finally, the final letter 'b'. Now imagine matching this regular expression to a string abcbd. Here's how the comparison happens in stages:

1.a- 'a' corresponds to regular expression
2. abcbd- the engine matches [bcd]*as many characters as possible, that is, to the end of the line (since all characters correspond to the class in brackets [])
3. Failure - the engine tries to match the last character in the regular expression - the letter b, but the current position is already at the end of the line, where there are no characters, so it fails.
4. abcb- go back, reduce the comparison with [bcd]*
5 by one character . Failure - try to find again b, but at the end only d
6. abc- go back again, now [bcd]*it's only bc
7.abcb- again look for the last character of the regular expression - b. Now he really is on the desired position and we succeed

So, there has been an end to PB and the comparison has given him abcb. This example showed how the engine first gets as far as it can, and if it does not find a match, it goes back again and again working with the rest of the regular expression. He will do so until he gets zero matches for [bcd]*, and if then there is no match, he will conclude that the string does not match the PB pattern at all.

Another repetition metacharacter is +that repeating a comparison sequence one or more times. Pay particular attention to the difference between *and +.*requires matching the necessary part zero or more times, that is, the recurring may not be present at all, but +requires at least one occurrence. For a similar example, it ca+twill be compared, cator, for example caaat, but not at all ct.

There are two more repeating qualifiers. A question mark ?, checking for a match zero or one time. For example, it home-?brewmatches both homebrew, and home-brew.

The most complete repeating specifier is {m,n}where mand nare integers. This determinant means that there should be no less mand no more nrepetitions. For example, a/{1,3}bmatchesa/b, a//band a///b. It cannot be ab, a line in which there are no slashes or a////bin which there are four of them.

You may not ask mor n, then for the absent, the most reasonable value is assumed. Lowering mmeans that the lower limit is 0, lowering nassumes infinity as the upper limit, but, as mentioned above, the latter is limited by memory.

Readers could already notice that all three other qualifiers can be expressed through the last. {0,}it is the same as *, {1,} is equivalent +, and {0,1}can replace the sign ?.

Using Regular Expressions


Now that we have covered some simple regular expressions, how can we use them in Python? The module reprovides an interface for regular expressions, which allows you to compile regular expressions into objects, and then perform comparisons with them.

Regular expression compilation

Regular expressions are compiled into template objects that have methods for various operations, such as finding a pattern entry or performing a string replacement.

Import statement the re >>>
>>> p = re.compile ( 'the ab *')
>>> print p
<object _sre.SRE_Pattern AT 0x ...>


re.compile()also accepts optional arguments used to include various syntax features and variations: The

>>> p = re.compile('ab*', re.IGNORECASE)

regular expression is passed re.compile()as a string. Regular expressions are treated as strings because they are not part of the Python language, and there is no special syntax for expressing them. (There are applications that do not need regular expressions at all, so there is no need to forget the language specification, including them.) Instead, there is a module re, which is a wrapper of a module in C, like modules socketor zlib.

Passing regular expressions as a string allows Python to be simpler, but has one drawback, which is the topic of the next section.

Baxlesch disaster
(Or the reverse oblique plague :))



As noted earlier, in regular expressions, a backslash character ( '\') is used to indicate a special form or to allow characters to lose their special role . This leads to a conflict using the same character in Python string literals for the same purpose.

Say you want to write a regular expression matching \section, which should be found in LaTeX-file. To figure out what to write in the program code, we start with the line that needs to be matched. Next, you must avoid any backslashes and other metacharacters by escaping them with a backslash, as a result of which a part appears in the line \\. Then, the resulting string to be passed re.compile ()must be\\section. However, in order to express this as a Python string literal, both backslashes must be escaped again , that is "\\\\section".

In a word, to match backslash, you need to write as a regular expression string '\\\\', because there must be a regular expression \\, and each backslash must be converted to a regular string as \\.

The solution is to use raw string for regular expressions; in string literals with a prefix, 'r'slashes are not processed in any way, so this r"\n"is a string of two characters ('\' and 'n'), and "\ n" - of a single newline character. Therefore, regular expressions will often be written using raw strings.

Regular StringRaw string
'ab *'r'ab * '
'\\\\ section'r '\\ section *'
'\\ w + \\ s + \\ 1'r '\ w + \ s + \ 1'


Matching


After you have an object representing a compiled regular expression, what will you do with it? Template objects have several methods and attributes. Only the most important of them will be considered here; See the documentationre for a complete list .

Method / Attributegoal
match ()Determine if a regular expression matches at the beginning of a line
search ()Scan the entire string for all regular expression matches
findall ()Find all regex matching substrings and return them as a list
finditer ()Find all substrings of regular expression matches and return them as an iterator


If you did not find any matches, then match()and search()return None. If the search is successful, an instance is returned MatchObjectcontaining information about the match: where it begins and ends, a substring of the match, and so on.

You can find out about this by interactively experimenting with the module re. You can also take a look at the Tools/scripts/redemo.pydemo included with the Python distribution. It allows you to enter regular expressions and strings, and displays whether there is a match with the regular expression or not. redemo.pycan be very useful for debugging complex regular expressions. Phil Schwartz's Kodos is another interactive tool for developing and testing PB models.

In this tutorial, we use the standard Python interpreter for examples:

>>> import re
>>> p = re.compile ('[az] +')
>>> p
<_sre.SRE_Pattern object at 0x ...>


Now you can try to compare strings for regular expression [a-z]+. An empty line will not match it, because it +means repeating “one or more” times. match()in this case should return None, as we see:

P.match >>> ( "")
>>> print p.match ( "")
to None


Now let's try line, which should coincide with the pattern: 'tempo'. In this case, it match()will return MatchObjectwhich you can place in some variable in order to use it in the future:

>>> m = p.match ('tempo')
>>> print m
<_sre.SRE_Match object at 0x ...>


Now you can call MatchObjectto get information about the corresponding lines. There are MatchObjectalso several methods and attributes, the most important of which are:

Method / Attributegoal
group ()Return a string matching a regular expression
start ()Return start match position
end ()Return end match position
span ()Return a tuple (start, end) of matching positions

>>> m.group ()
'tempo'
>>> m.start (), m.end ()
(0, 5)
>>> m.span ()
(0, 5)


Так как метод match() проверяет совпадения только с начала строки, start() всегда будет возвращать 0. Однако метод search() сканирует всю строку, так что для него начало не обязательно в нуле:

>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<_sre.SRE_Match object at 0x...>
>>> m.group()
'message'
>>> m.span()
(4, 11)


В реальных программах наиболее распространенный стиль это хранение MatchObject в переменной, а затем проверка по None. Обычно это выглядит следующим образом:

p = re.compile (...)
m = p.match ('string goes here')
if m:
    print 'Match found:', m.group ()
else:
    print 'No match'


Two methods return all matches for the template. findall()returns a list of matching substrings:

>>> p = re.compile ('\ d +')
>>> p.findall ('12 drummers drumming, 11 pipers piping, 10 lords a-leaping ')
[' 12 ',' 11 ',' 10 ']


The method findall()must create a complete list before it can be returned as a result. The method finditer()returns a sequence of instances MatchObjectas an iterator.

>>> iterator = p.finditer ('12 drummers drumming , 11 ... 10 ... ')
>>> iterator

>>> for match in iterator:
... print match.span ()
...
(0, 2)
(22, 24)
(29, 31)


Module Level Functions


You do not need to create template objects and call their methods; the module realso provides top-level functions match(), search(), findall(), sub()and so on. These functions take the same arguments as for templates, with the string PB as the first argument and also return Noneor MatchObject.

>>> print re.match (r'From \ s + ',' Fromage amk ')
None
>>> re.match (r'From \ s +', 'From amk Thu May 14 19:12:10 1998')
< _sre.SRE_Match object at 0x ...>


These functions simply create a template object for you and call the appropriate method. They also store the object in a cache, so future calls using the same regular expression will be faster.

Should you use these functions or templates with methods? It depends on how often the regular expression will be used and on your personal coding style. If a regular expression is used only in one place in the code, then such functions are probably more convenient. If the program contains many regular expressions, or reuses the same ones in several places, then it will be advisable to collect all the definitions in one place, in the section of code that precompiles all regular expressions. As an example from the standard library, here is a piece from xmllib.py:

ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )


I myself prefer to work with compiled objects, even for one-time use, but few will be the same purist in this as I am.

Compilation flags


Compilation flags allow you to change some aspects of how regular expressions work. Flags are available in the module under two names: long, such as IGNORECASEshort, in single-letter form, such as I. Several flags can be specified in the form of binary OR; for example, re.I | re.Msets the flags I and M.

DOTALL, S
Match, the same as '.', that is, with any character, but when this flag is turned on, a newline character is added to the consideration.

IGNORECASE, I Case-insensitive
matching; For example, it [A-Z]will also match lowercase letters, so it Spamwill match Spam, spam, spAMand so on.

LOCALE, L
Does\w, \W, \b, \B зависящими от локализации. Например, если вы работаете с текстом на французском, и хотите написать \w+ для того, чтобы находить слова, но \w ищет только символы из множества [A-Za-z] и не будет искать 'é' или 'ç'. Если система настроена правильно и выбран французский язык, 'é' также будет рассматриваться как буква.

MULTILINE, M
(Метасимволы ^ и $ еще не были описаны; они будут представлены немного позже, в начале второй части этого пособия.

Обычно ^ ищет соответствие только в начале строки, а $ только в конце непосредственно перед символом новой строки (если таковые имеются). Если этот флаг указан, ^Comparison occurs in all lines, that is, at the beginning, and immediately after each character of a new line. Similarly for $.

UNICODE, U
Makes \w, \W, \b, \B, \d, \D, \s, \Sthe Unicode table appropriate.

VERBOSE, X
Includes verbose (verbose) regular expressions that can be organized more clearly and clearly. If this flag is specified, spaces in the regular expression string are ignored, unless they are in the character class or preceded by unescaped backslashes; this allows you to organize regular expressions in a clearer way. This flag also allows you to put in regular expressions comments starting with '#'which will be ignored by the engine.

An example of how RV becomes much easier to read:

charref = re.compile (r "" "
 & [#] # Start of a numeric entity reference
 (
     0 [0-7] + # Octal form
   | [0-9] + # Decimal form
   | x [0-9a-fA -F] + # Hexadecimal form
 )
 ; # Trailing semicolon
"" ", re.VERBOSE)


Without verbose, it would look like this:

charref = re.compile ("& # (0 [0-7] +"
                     "| [0-9] +"
                     "| x [0-9a-fA-F] +);")


In the above example, Python's automatic concatenation of string literals was used to break the PB into smaller parts, but still, without explanation, this example is harder to understand than the version using re.VERBOSE.

At this point, we will complete our review for now. I advise you to relax a bit before the second half , which contains a story about other metacharacters, methods for splitting, searching and replacing strings, and a large number of examples of using regular expressions.

Continuation

Also popular now: