
Python Parsim: Pyparsing for Beginners
Parsing (parsing) is the process of matching a sequence of words or characters - the so-called formal grammar. For example, for a line of code:
the following grammar takes place: first comes the keyword import, then the name of the module or a chain of names of modules separated by a dot, then the keyword as, and then our name for the imported module.
As a result of parsing, for example, it may be necessary to arrive at the following expression:
This expression is a Python dictionary that has two keys: 'import' and 'as'. The value for the 'import' key is a list in which the names of the imported modules are listed in order.
For parsing, regular expressions are usually used. To do this, there is a Python module called re (regular expression). If you have never worked with regular expressions, their appearance can scare you. For example, for the line of code 'import matplotlib.pyplot as plt' it will look like:
Fortunately, there is a convenient and flexible parsing tool called Pyparsing. Its main advantage is that it makes the code more readable, and also allows for additional processing of the analyzed text.
In this article, we will install Pyparsing and create our first parser on it.
First install Pyparsing. If you are working on Linux, at a command prompt, type:
On Windows, you need to, at a command prompt running as administrator, first go to the directory where the pip.exe file is located (for example, C: \ Python27 \ Scripts \), and then run:
Another way is to go to the Pyparsing project page on SourceForge , download the installer for Windows there and install Pyparsing as a regular program. Full information on all possible ways to install Pyparsing can be found on the project page .
Let's move on to parsing. Let s be the following line:
As a result of parsing, we want to get a dictionary:
First you need to import Pyparsing. Run for example Python IDLE and type:
An asterisk * above means importing all names from pyparsing. As a result, this can disrupt the workspace of names, which will lead to errors in the program. In our case * is used temporarily, because we do not yet know which classes from Pyparsing we will use. After we write the parser, we replace * with the names of the classes we used.
When using pyparsing, the parser is first written for individual keywords, characters, short phrases, and then the parser for the entire text is obtained from the individual parts.
To begin with, we have a module name in the line. Formal Grammar: in the general case, the name of the module is a word consisting of letters and the underscore. On pyparsing:
Word is a word, alphas are letters.
The full name of the module is the name of the module, then the dot, then the name of another module, then the dot again, then the name of the third module, and so on, until we get to the desired module in the chain. The full name of the module can consist of the name of one module and have no periods. On pyparsing:
ZeroOrMore literally translates to “zero or more,” meaning that the contents in parentheses may be repeated several times or absent. As a result, we read the entire second line of the parser: the full name of the module is the name of the module, after which the dot and the name of the module go zero or more times.
After the full name of the module comes the optional part 'as plt'. It is the keyword 'as', followed by the name that we ourselves gave to the imported module. On pyparsing:
Optional literally translates to “optional,” which means that the contents in parentheses may or may not be present. In total, we get: “an optional expression consisting of the word 'as' and the name of the module.
A complete import statement consists of the import keyword, followed by the full name of the module, then the optional 'as plt' construct. On pyparsing:
As a result, we have our first parser:
Now you need to parse the string s:
We will get:
The output can be improved by converting the result to a list:
We get:
Now we will improve the parser. First of all, we would not want to see the word import and the dot between the names of the modules in the parser output. Suppress () is used to suppress output. With this in mind, our parser looks like this:
Having completed
Let's now make the parser immediately return the view dictionary to us
As you can see from the two lines above, to give the parsing result a name, you need to put the parser expression in brackets, and after this expression in parentheses give the name of the result. Let's see what has changed. To do this, execute the code:
We get:
Now we can separately extract the chain of modules to import the desired one and our name for it. It remains to make the parser return the dictionary. For this, the so-called ParseAction is used - an action in the process of parsing:
lambda is an anonymous function in Python, t is an argument to this function. Then comes the colon and the expression of the Python dictionary into which we substitute the data we need. When we call asList (), we get a list. The name of the module after as is always one, and the list
Check the parser. Run
We are almost there. Since the resulting list has a single argument, add [0] at the end of the line to parse the text: As a result:
We got what we wanted.
Having reached the goal, you need to return to 'from pyparsing import *' and change the asterisk to those classes that are useful to us:
As a result, our code has the following form:
We examined a very simple example and only a small part of the capabilities of Pyparsing. Overboard - creating recursive expressions, processing tables, text search with optimization that dramatically speeds up the search itself, and much more.
In conclusion, a few words about yourself. I am a graduate student and assistant at MSTU. Bauman (Department of MT-1 "Metal-cutting machines"). I am fond of Python, Linux, HTML, CSS and JS. My hobby is the automation of engineering activities and engineering calculations. I think that I can be useful to Habr, sharing my knowledge about working in Pyparsing, Sage and some features of automation of engineering calculations. I also know the SageMathCloud environment, which is a powerful alternative to Wolfram Alpha. SageMathCloud is geared towards doing Python calculations in the cloud. At the same time, you can access the console (Ubuntu under the hood), Sage, IPython and LaTeX. There is a possibility of collaboration. In addition to Python code, SageMathCloud supports html, css, js, coffescript, go, fortran, scilab and much more. Currently, the environment is free (fairly stable beta), then it will work on the Freemium system. At the current moment of time, this environment is not covered on Habré, and I would like to fill this gap.
Thanks to Daria Frolova and Nikita Konovalov for their help in editing the article.
import matplotlib.pyplot as plt
the following grammar takes place: first comes the keyword import, then the name of the module or a chain of names of modules separated by a dot, then the keyword as, and then our name for the imported module.
As a result of parsing, for example, it may be necessary to arrive at the following expression:
{ 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' }
This expression is a Python dictionary that has two keys: 'import' and 'as'. The value for the 'import' key is a list in which the names of the imported modules are listed in order.
For parsing, regular expressions are usually used. To do this, there is a Python module called re (regular expression). If you have never worked with regular expressions, their appearance can scare you. For example, for the line of code 'import matplotlib.pyplot as plt' it will look like:
r'^[ \t]*import +\D+\.\D+ +as \D+'
Fortunately, there is a convenient and flexible parsing tool called Pyparsing. Its main advantage is that it makes the code more readable, and also allows for additional processing of the analyzed text.
In this article, we will install Pyparsing and create our first parser on it.
First install Pyparsing. If you are working on Linux, at a command prompt, type:
sudo pip install pyparsing
On Windows, you need to, at a command prompt running as administrator, first go to the directory where the pip.exe file is located (for example, C: \ Python27 \ Scripts \), and then run:
pip install pyparsing
Another way is to go to the Pyparsing project page on SourceForge , download the installer for Windows there and install Pyparsing as a regular program. Full information on all possible ways to install Pyparsing can be found on the project page .
Let's move on to parsing. Let s be the following line:
s = 'import matplotlib.pyplot as plt'
As a result of parsing, we want to get a dictionary:
{ 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' }
First you need to import Pyparsing. Run for example Python IDLE and type:
from pyparsing import *
An asterisk * above means importing all names from pyparsing. As a result, this can disrupt the workspace of names, which will lead to errors in the program. In our case * is used temporarily, because we do not yet know which classes from Pyparsing we will use. After we write the parser, we replace * with the names of the classes we used.
When using pyparsing, the parser is first written for individual keywords, characters, short phrases, and then the parser for the entire text is obtained from the individual parts.
To begin with, we have a module name in the line. Formal Grammar: in the general case, the name of the module is a word consisting of letters and the underscore. On pyparsing:
module_name = Word(alphas + '_')
Word is a word, alphas are letters.
Word(alphas + '_')
- A word made up of letters and underscores. module_name translates to the name of the module. Now we read everything together: the name of the module is a word consisting of letters and the underscore. Thus, a Pyparsing entry is very close to natural language. The full name of the module is the name of the module, then the dot, then the name of another module, then the dot again, then the name of the third module, and so on, until we get to the desired module in the chain. The full name of the module can consist of the name of one module and have no periods. On pyparsing:
full_module_name = module_name + ZeroOrMore('.' + module_name)
ZeroOrMore literally translates to “zero or more,” meaning that the contents in parentheses may be repeated several times or absent. As a result, we read the entire second line of the parser: the full name of the module is the name of the module, after which the dot and the name of the module go zero or more times.
After the full name of the module comes the optional part 'as plt'. It is the keyword 'as', followed by the name that we ourselves gave to the imported module. On pyparsing:
import_as = Optional('as' + module_name)
Optional literally translates to “optional,” which means that the contents in parentheses may or may not be present. In total, we get: “an optional expression consisting of the word 'as' and the name of the module.
A complete import statement consists of the import keyword, followed by the full name of the module, then the optional 'as plt' construct. On pyparsing:
parse_module = 'import' + full_module_name + import_as
As a result, we have our first parser:
module_name = Word(alphas + '_')
full_module_name = module_name + ZeroOrMore('.' + module_name)
import_as = Optional('as' + module_name)
parse_module = 'import' + full_module_name + import_as
Now you need to parse the string s:
parse_module.parseString(s)
We will get:
(['import', 'matplotlib', '.', 'pyplot', 'as', 'plt'], {})
The output can be improved by converting the result to a list:
parse_module.parseString(s).asList()
We get:
['import', 'matplotlib', '.', 'pyplot', 'as', 'plt']
Now we will improve the parser. First of all, we would not want to see the word import and the dot between the names of the modules in the parser output. Suppress () is used to suppress output. With this in mind, our parser looks like this:
module_name = Word(alphas + '_')
full_module_name = module_name + ZeroOrMore(Suppress('.') + module_name)
import_as = Optional(Suppress('as') + module_name)
parse_module = Suppress('import') + full_module_name + import_as
Having completed
parse_module.parseString(s).asList()
, we get:['matplotlib', 'pyplot', 'plt']
Let's now make the parser immediately return the view dictionary to us
{'import':[модуль1, модуль2, ...], 'as':модуль}
. Before doing this, you first need to separately access the list of imported modules (full_module_name) and our own module name (import_as). For this, pyparsing allows you to assign names to parsing results. Let's give the list of imported modules the name 'modules', and, as we ourselves called the module, the name 'import as':full_module_name = (module_name + ZeroOrMore(Suppress('.') + module_name))('modules')
import_as = (Optional(Suppress('as') + module_name))('import_as')
As you can see from the two lines above, to give the parsing result a name, you need to put the parser expression in brackets, and after this expression in parentheses give the name of the result. Let's see what has changed. To do this, execute the code:
res = parse_module.parseString(s)
print(res.modules.asList())
print(res.import_as.asList())
We get:
['matplotlib', 'pyplot']
['plt']
Now we can separately extract the chain of modules to import the desired one and our name for it. It remains to make the parser return the dictionary. For this, the so-called ParseAction is used - an action in the process of parsing:
parse_module = (Suppress('import') + full_module_name).setParseAction(lambda t: {'import': t.modules.asList(), 'as': t.import_as.asList()[0]})
lambda is an anonymous function in Python, t is an argument to this function. Then comes the colon and the expression of the Python dictionary into which we substitute the data we need. When we call asList (), we get a list. The name of the module after as is always one, and the list
t.import_as.asList()
will always contain only one value. Therefore, we take the only element of the list (it has an index of zero) and write asList () [0]. Check the parser. Run
parse_module.parseString(s).asList()
and get:[{ 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' }]
We are almost there. Since the resulting list has a single argument, add [0] at the end of the line to parse the text: As a result:
parse_module.parseString(s).asList()[0]
{ 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' }
We got what we wanted.
Having reached the goal, you need to return to 'from pyparsing import *' and change the asterisk to those classes that are useful to us:
from pyparsing import Word, alphas, ZeroOrMore, Suppress, Optional
As a result, our code has the following form:
from pyparsing import Word, alphas, ZeroOrMore, Suppress, Optional
module_name = Word(alphas + "_")
full_module_name = (module_name + ZeroOrMore(Suppress('.') + module_name))('modules')
import_as = (Optional(Suppress('as') + module_name))('import_as')
parse_module = (Suppress('import') + full_module_name + import_as).setParseAction(lambda t: {'import': t.modules.asList(), 'as': t.import_as.asList()[0]})
We examined a very simple example and only a small part of the capabilities of Pyparsing. Overboard - creating recursive expressions, processing tables, text search with optimization that dramatically speeds up the search itself, and much more.
In conclusion, a few words about yourself. I am a graduate student and assistant at MSTU. Bauman (Department of MT-1 "Metal-cutting machines"). I am fond of Python, Linux, HTML, CSS and JS. My hobby is the automation of engineering activities and engineering calculations. I think that I can be useful to Habr, sharing my knowledge about working in Pyparsing, Sage and some features of automation of engineering calculations. I also know the SageMathCloud environment, which is a powerful alternative to Wolfram Alpha. SageMathCloud is geared towards doing Python calculations in the cloud. At the same time, you can access the console (Ubuntu under the hood), Sage, IPython and LaTeX. There is a possibility of collaboration. In addition to Python code, SageMathCloud supports html, css, js, coffescript, go, fortran, scilab and much more. Currently, the environment is free (fairly stable beta), then it will work on the Freemium system. At the current moment of time, this environment is not covered on Habré, and I would like to fill this gap.
Thanks to Daria Frolova and Nikita Konovalov for their help in editing the article.