Practical use of Racc - LALR (1) parser generator for Ruby
- From the sandbox
- Tutorial
As part of creating a framework for some Enterprise class system, I had the task of creating a utility for automated code generation using the UML model. Nothing more suitable for a quick and effective solution to the problem, except using Ruby, and the built-in ERB template engine, turned up by the hand.
The project file from the UML modeling environment was an SQLite3 format database, however, the environment stored some of the information in this database in the form of objects serialized in the BLOB field. The serialization format was textual, but not compatible with any of the well-known, such as XML, YAML, remotely resembled JSON. It was impossible to use parsers existing in nature.
In simple cases, when you do not need the whole object, but only a couple of scalar fields of a specific instance, then of course you can stupidly get to the desired regulars. Otherwise, there is a universal solution to the problem, allowing you to quickly create your own parsers for such structures, deserializing them into Ruby objects.
This is what the original serialized objects look like.
The task is to get this in the form of an integrated Ruby framework based on Array and Hash.
If we analyze the original format, then we can distinguish the following elements that clearly define its grammar:
Such a grammar is context-free, it can easily be described using the Backus-Naur form . And for analysis, use the upward shift / convolution algorithm. One of the most common algorithms in this category is LALR (1) , it is used in such well-known "compiler compilers" as Yacc / GNU Bison. There is also an implementation for the Ruby platform that interests us - this is Racc . We will use it to create our parser.
In order to baseline in Racc and create elements of our future parser, it’s enough to present in a simplified form the principle of the algorithm for the upward shift / convolution analysis.
Suppose we have an input stream of characters:
We also have grammar rules defined:
The incoming stream is divided into two parts by a marker: the analyzed left, and the unanalyzed right. The marker is visually represented by the '|' symbol. The algorithm sequentially reads one character from the stream, this is a shift operation (Shift), which moves the marker one position to the right. The next step, the algorithm performs a Reduce operation for the part left of the marker, in accordance with the rules of grammar. And so on until the end, until the characters run out and all the convolutions are satisfied.
The runtime environment for Racc-based parsers is already included in Ruby. Starting from version 1.8.x, any parser created using Racc will work without additional gestures. However, in order to be able to create your own parsers, you must install the Racc package containing the parser generator. This is done standardly via gem:
The source for the Racc parser generator is a file (usually with the extension .y) that contains the definition of the Ruby parser class and includes additional Racc directives. The following elements must be present in the parser definition:
So, the first thing to do is to distinguish in the grammar the so-called terminal symbols - elementary particles from which more complex constructions are composed, and give them names. We get something like this:
In the parser class definition, valid tokens are defined using the token keyword .
The next step is to create a lexical analyzer. Its functions include scanning the incoming stream and identifying tokens in it. The lexical analyzer should transform the incoming stream into a set of consecutive tokens. Racc, when performing a shift operation, should receive a new token converted from the lexical analyzer from the incoming stream. He does this by calling the next_token method on the parser class , which should return a structure containing the name and value of the token:
As a label for the end of the stream is the value:
It is convenient to use the StringScanner class to create a lexical analyzer . Sometimes it’s important to choose the correct scanning order of templates, as some tokens may overlap others. In this example, the lexical analyzer performs full processing of the entire incoming stream immediately at the start of the parser, calling the tokenize method , and storing the resulting tokens in an array, from which the next_token method gets the next token every time .
The next step is to describe the grammar rules. In Racc, this is done using the rule directive , following which the rules are set using the Backus-Naur form. Nonterminal characters are expressed through terminal and nonterminal. For example, in the following fragment the nonterminal attribute is specified - the attribute of an object, which, in accordance with our grammar, is an identifier, an equal sign following it, an arbitrary value expressed by another nonterminal, and a terminating semicolon:
The curly brackets after the rule specify the Ruby expression, which should convolve the given sequence of characters before the declared nonterminal, in this case, attribute . The variables val and result are predefined by Racc. The val array contains the values of the collapsible sequence, in this case val [0] contains the value T_IDENTIFIER, val [2] the value value. The result variable is placed into the result variable.
To start the parser, you must call the do_parse method . This method is defined in the Racc :: Parser class , from which our parser class will be inherited.
Below is the full source code for the parser definition.
The .y file is not a parser yet. The parser needs to be generated using the racc utility , and it needs to be done every time after changing the parser definition in the .y file:
The resulting file with the parser class can be connected and used in the usual way:
This will be the result of the parser - the comprehensive Ruby structure:
The project file from the UML modeling environment was an SQLite3 format database, however, the environment stored some of the information in this database in the form of objects serialized in the BLOB field. The serialization format was textual, but not compatible with any of the well-known, such as XML, YAML, remotely resembled JSON. It was impossible to use parsers existing in nature.
In simple cases, when you do not need the whole object, but only a couple of scalar fields of a specific instance, then of course you can stupidly get to the desired regulars. Otherwise, there is a universal solution to the problem, allowing you to quickly create your own parsers for such structures, deserializing them into Ruby objects.
This is what the original serialized objects look like.
The task is to get this in the form of an integrated Ruby framework based on Array and Hash.
data.txt
7ghoJdyGAqACCgeT:"User":Class {
FromEndRelationships=(
,
<9bToJdyGAqACCgfF>
);
_masterViewId="7ghoJdyGAqACCgeS";
pmLastModified="1355667704781";
pmAuthor="author";
Child=(
{UwZoJdyGAqACCgei:"name":Attribute {
visibility=71;
pmLastModified="1355667655234";
pmAuthor="author";
type=<_n2oJdyGAqACCgXh>;
pmCreateDateTime="1355667628234";
_modelViews=NULL;
_modelEditable=T;
}},
{9lZoJdyGAqACCgel:"created":Attribute {
visibility=71;
type_string="date";
pmLastModified="1355667655234";
pmAuthor="author";
pmCreateDateTime="1355667630703";
_modelViews=NULL;
_modelEditable=T;
}},
{nLFoJdyGAqACCgeo:"active":Attribute {
visibility=71;
pmLastModified="1355667655234";
pmAuthor="author";
type=<_n2oJdyGAqACCgXY>;
pmCreateDateTime="1355667639609";
_modelViews=NULL;
_modelEditable=T;
}}
);
pmCreateDateTime="1355667607671";
_modelViews=(
{4QhoJdyGAqACCgeU:"View":ModelView {
container=;
view="7ghoJdyGAqACCgeS";
}}
);
_modelEditable=T;
}
If we analyze the original format, then we can distinguish the following elements that clearly define its grammar:
- An object. It begins with a line in the format Id: "Name": Type , and then in curly brackets contains a set of attributes.
- Attributes of an object. Indicated as key = value , plus a semicolon.
- Attribute value. It can be a string, number, reference, other object, or a collection of any of the listed values.
- Values are specified in a specific notation for each type. The string is enclosed in quotation marks. The object is enclosed in braces. A collection of values is enclosed in parentheses, separated by commas.
Such a grammar is context-free, it can easily be described using the Backus-Naur form . And for analysis, use the upward shift / convolution algorithm. One of the most common algorithms in this category is LALR (1) , it is used in such well-known "compiler compilers" as Yacc / GNU Bison. There is also an implementation for the Ruby platform that interests us - this is Racc . We will use it to create our parser.
Upward shift / convolution algorithm
In order to baseline in Racc and create elements of our future parser, it’s enough to present in a simplified form the principle of the algorithm for the upward shift / convolution analysis.
Suppose we have an input stream of characters:
ВОДА + СПИРТ + СОКWe also have grammar rules defined:
ВОДКА это ВОДА + СПИРТКОКТЕЙЛЬ это ВОДКА + СОКThe incoming stream is divided into two parts by a marker: the analyzed left, and the unanalyzed right. The marker is visually represented by the '|' symbol. The algorithm sequentially reads one character from the stream, this is a shift operation (Shift), which moves the marker one position to the right. The next step, the algorithm performs a Reduce operation for the part left of the marker, in accordance with the rules of grammar. And so on until the end, until the characters run out and all the convolutions are satisfied.
s: ВОДА | + СПИРТ + СОКr: ВОДА | + СПИРТ + СОКs: ВОДА + | СПИРТ + СОКr: ВОДА + | СПИРТ + СОКs: ВОДА + СПИРТ | + СОКr: ВОДКА | + СОКs: ВОДКА + | СОКr: ВОДКА + | СОКs: ВОДКА + СОК |r: КОКТЕЙЛЬ |Install Racc
The runtime environment for Racc-based parsers is already included in Ruby. Starting from version 1.8.x, any parser created using Racc will work without additional gestures. However, in order to be able to create your own parsers, you must install the Racc package containing the parser generator. This is done standardly via gem:
# gem install raccCreating a parser on Racc
The source for the Racc parser generator is a file (usually with the extension .y) that contains the definition of the Ruby parser class and includes additional Racc directives. The following elements must be present in the parser definition:
- Grammar terminal character definitions - tokens
- Grammar and convolution rule definitions
- Lexical analyzer
Tokens
So, the first thing to do is to distinguish in the grammar the so-called terminal symbols - elementary particles from which more complex constructions are composed, and give them names. We get something like this:
- T_OBJECTID - object identifier
- T_IDENTIFIER - attribute identifier
- T_STRING - text string
- T_NUMBER - number
- T_BOOLEAN - Boolean value
- T_NULL - empty value
- T_REFERENCE - link
- T_LBR, T_RBR, T_LPAR, T_RPAR, T_EQ, T_COMMA, T_SEMICOLON - punctuation marks: different brackets, equal sign, comma, semicolon
In the parser class definition, valid tokens are defined using the token keyword .
Lexical analyzer
The next step is to create a lexical analyzer. Its functions include scanning the incoming stream and identifying tokens in it. The lexical analyzer should transform the incoming stream into a set of consecutive tokens. Racc, when performing a shift operation, should receive a new token converted from the lexical analyzer from the incoming stream. He does this by calling the next_token method on the parser class , which should return a structure containing the name and value of the token:
[:TOKEN_NAME, 'Token value']As a label for the end of the stream is the value:
[false, false]It is convenient to use the StringScanner class to create a lexical analyzer . Sometimes it’s important to choose the correct scanning order of templates, as some tokens may overlap others. In this example, the lexical analyzer performs full processing of the entire incoming stream immediately at the start of the parser, calling the tokenize method , and storing the resulting tokens in an array, from which the next_token method gets the next token every time .
Grammar rules
The next step is to describe the grammar rules. In Racc, this is done using the rule directive , following which the rules are set using the Backus-Naur form. Nonterminal characters are expressed through terminal and nonterminal. For example, in the following fragment the nonterminal attribute is specified - the attribute of an object, which, in accordance with our grammar, is an identifier, an equal sign following it, an arbitrary value expressed by another nonterminal, and a terminating semicolon:
attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } };The curly brackets after the rule specify the Ruby expression, which should convolve the given sequence of characters before the declared nonterminal, in this case, attribute . The variables val and result are predefined by Racc. The val array contains the values of the collapsible sequence, in this case val [0] contains the value T_IDENTIFIER, val [2] the value value. The result variable is placed into the result variable.
To start the parser, you must call the do_parse method . This method is defined in the Racc :: Parser class , from which our parser class will be inherited.
Below is the full source code for the parser definition.
parser.y
class Parser
token T_OBJECTID T_STRING T_REFERENCE T_IDENTIFIER T_NUMBER T_BOOLEAN T_NULL
token T_LBR T_RBR T_LPAR T_RPAR T_EQ T_COMMA T_SEMICOLON
start input
rule
input: object { @result = val[0] };
object: T_OBJECTID T_LBR attributes T_RBR {
oid = val[0].split(':');
result = {
:id => dequote(oid[0]),
:name => convertNULL(dequote(oid[1])),
:type => dequote(oid[2])
}.merge(val[2])
};
attributes: attribute { result = val[0] }
| attributes attribute { result = val[0].merge(val[1]) }
;
attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } };
values: value { result = [val[0]] }
| values T_COMMA value { val[0] << val[2] }
;
value: T_STRING { result = dequote(val[0]) }
| T_REFERENCE { result = dequote(val[0]) }
| T_NUMBER { result = val[0].to_i }
| T_BOOLEAN { result = val[0] == 'T' }
| T_NULL { result = nil }
| T_LBR object T_RBR { result = val[1] }
| T_LPAR values T_RPAR { result = val[1] }
;
---- header ----
require 'strscan'
---- inner ----
def tokenize(text)
tokens = []
s = StringScanner.new(text)
until s.eos?
case
when s.skip(/\s+/)
next
when s.scan(/\A[\w\.]+:\"*\w*\"*:\w+/)
tokens << [:T_OBJECTID, s[0]]
next
when s.scan(/\A\{/)
tokens << [:T_LBR, nil]
next
when s.scan(/\A\}/)
tokens << [:T_RBR, nil]
next
when s.scan(/\A\(/)
tokens << [:T_LPAR, nil]
next
when s.scan(/\A\)/)
tokens << [:T_RPAR, nil]
next
when s.scan(/\A\=/)
tokens << [:T_EQ, nil]
next
when s.scan(/\A\,/)
tokens << [:T_COMMA, nil]
next
when s.scan(/\A\;/)
tokens << [:T_SEMICOLON, nil]
next
when s.scan(/\w+/)
if (s[0].match(/[0-9]+/))
tokens << [:T_NUMBER, s[0]]
elsif (s[0].match(/\A[TF]{1}\Z/))
tokens << [:T_BOOLEAN, s[0]]
elsif (s[0].match(/\ANULL\Z/))
tokens << [:T_NULL, nil]
else
tokens << [:T_IDENTIFIER, s[0]]
end
next
when s.scan(/(["]).*?(?/)
tokens << [:T_REFERENCE, s[0]]
next
else
s.getch
next
end
end
tokens << [false, false]
return tokens
end
def parse(text)
@tokens = tokenize(text)
do_parse
return @result
end
def next_token
@tokens.shift
end
def dequote(text)
text.gsub(/\A["<]|[">]\Z/, '').strip
end
def convertNULL(text)
text.upcase == "NULL" ? nil : text
end
The .y file is not a parser yet. The parser needs to be generated using the racc utility , and it needs to be done every time after changing the parser definition in the .y file:
# racc parser.y -o parser.rbThe resulting file with the parser class can be connected and used in the usual way:
main.rb
require 'pp'
require './parser.rb'
parser = Parser.new
obj = parser.parse(File.read("data.txt"))
puts obj.pretty_inspect
This will be the result of the parser - the comprehensive Ruby structure:
output
{
:id => "7ghoJdyGAqACCgeT",
:name => "User",
:type => "Class",
"FromEndRelationships" => ["vdHoJdyGAqACCgfe", "9bToJdyGAqACCgfF"],
"_masterViewId" => "7ghoJdyGAqACCgeS",
"pmLastModified" => "1355667704781",
"pmAuthor" => "author",
"Child" => [
{
:id => "UwZoJdyGAqACCgei",
:name => "name",
:type => "Attribute",
"visibility" => 71,
"pmLastModified" => "1355667655234",
"pmAuthor" => "author",
"type" => "_n2oJdyGAqACCgXh",
"pmCreateDateTime" => "1355667628234",
"_modelViews" => nil,
"_modelEditable" => true
},
{
:id => "9lZoJdyGAqACCgel",
:name => "created",
:type => "Attribute",
"visibility" => 71,
"type_string" => "date",
"pmLastModified" => "1355667655234",
"pmAuthor" => "author",
"pmCreateDateTime" => "1355667630703",
"_modelViews" => nil,
"_modelEditable" => true
},
{
:id => "nLFoJdyGAqACCgeo",
:name => "active",
:type => "Attribute",
"visibility" => 71,
"pmLastModified" => "1355667655234",
"pmAuthor" => "author",
"type" => "_n2oJdyGAqACCgXY",
"pmCreateDateTime" => "1355667639609",
"_modelViews" => nil,
"_modelEditable" => true
}
],
"pmCreateDateTime" => "1355667607671",
"_modelViews" => [
{
:id => "4QhoJdyGAqACCgeU",
:name => "View",
:type => "ModelView",
"container" => "hguoJdyGAqACCgeL",
"view" => "7ghoJdyGAqACCgeS"
}
],
"_modelEditable" => true
}