fischieye December 27, 2012 at 01:43

Practical use of Racc - LALR (1) parser generator for Ruby

From the sandbox
Tutorial

As part of creating a framework for some Enterprise class system, I had the task of creating a utility for automated code generation using the UML model. Nothing more suitable for a quick and effective solution to the problem, except using Ruby, and the built-in ERB template engine, turned up by the hand.

The project file from the UML modeling environment was an SQLite3 format database, however, the environment stored some of the information in this database in the form of objects serialized in the BLOB field. The serialization format was textual, but not compatible with any of the well-known, such as XML, YAML, remotely resembled JSON. It was impossible to use parsers existing in nature.

In simple cases, when you do not need the whole object, but only a couple of scalar fields of a specific instance, then of course you can stupidly get to the desired regulars. Otherwise, there is a universal solution to the problem, allowing you to quickly create your own parsers for such structures, deserializing them into Ruby objects.

This is what the original serialized objects look like.
The task is to get this in the form of an integrated Ruby framework based on Array and Hash.

data.txt

7ghoJdyGAqACCgeT:"User":Class {
	FromEndRelationships=(
		,
		<9bToJdyGAqACCgfF>
	);
	_masterViewId="7ghoJdyGAqACCgeS";
	pmLastModified="1355667704781";
	pmAuthor="author";
	Child=(
		{UwZoJdyGAqACCgei:"name":Attribute {
			visibility=71;
			pmLastModified="1355667655234";
			pmAuthor="author";
			type=<_n2oJdyGAqACCgXh>;
			pmCreateDateTime="1355667628234";
			_modelViews=NULL;
			_modelEditable=T;
		}},
		{9lZoJdyGAqACCgel:"created":Attribute {
			visibility=71;
			type_string="date";
			pmLastModified="1355667655234";
			pmAuthor="author";
			pmCreateDateTime="1355667630703";
			_modelViews=NULL;
			_modelEditable=T;
		}},
		{nLFoJdyGAqACCgeo:"active":Attribute {
			visibility=71;
			pmLastModified="1355667655234";
			pmAuthor="author";
			type=<_n2oJdyGAqACCgXY>;
			pmCreateDateTime="1355667639609";
			_modelViews=NULL;
			_modelEditable=T;
		}}
	);
	pmCreateDateTime="1355667607671";
	_modelViews=(
		{4QhoJdyGAqACCgeU:"View":ModelView {
			container=;
			view="7ghoJdyGAqACCgeS";
		}}
	);
	_modelEditable=T;
}

If we analyze the original format, then we can distinguish the following elements that clearly define its grammar:

An object. It begins with a line in the format Id: "Name": Type , and then in curly brackets contains a set of attributes.
Attributes of an object. Indicated as key = value , plus a semicolon.
Attribute value. It can be a string, number, reference, other object, or a collection of any of the listed values.
Values are specified in a specific notation for each type. The string is enclosed in quotation marks. The object is enclosed in braces. A collection of values is enclosed in parentheses, separated by commas.

Such a grammar is context-free, it can easily be described using the Backus-Naur form . And for analysis, use the upward shift / convolution algorithm. One of the most common algorithms in this category is LALR (1) , it is used in such well-known "compiler compilers" as Yacc / GNU Bison. There is also an implementation for the Ruby platform that interests us - this is Racc . We will use it to create our parser.

Upward shift / convolution algorithm

In order to baseline in Racc and create elements of our future parser, it’s enough to present in a simplified form the principle of the algorithm for the upward shift / convolution analysis.

Suppose we have an input stream of characters:

ВОДА + СПИРТ + СОК

We also have grammar rules defined:

ВОДКА это ВОДА + СПИРТ
КОКТЕЙЛЬ это ВОДКА + СОК

The incoming stream is divided into two parts by a marker: the analyzed left, and the unanalyzed right. The marker is visually represented by the '|' symbol. The algorithm sequentially reads one character from the stream, this is a shift operation (Shift), which moves the marker one position to the right. The next step, the algorithm performs a Reduce operation for the part left of the marker, in accordance with the rules of grammar. And so on until the end, until the characters run out and all the convolutions are satisfied.

s: ВОДА | + СПИРТ + СОК
r: ВОДА | + СПИРТ + СОК
s: ВОДА + | СПИРТ + СОК
r: ВОДА + | СПИРТ + СОК
s: ВОДА + СПИРТ | + СОК
r: ВОДКА | + СОК
s: ВОДКА + | СОК
r: ВОДКА + | СОК
s: ВОДКА + СОК |
r: КОКТЕЙЛЬ |

Install Racc

The runtime environment for Racc-based parsers is already included in Ruby. Starting from version 1.8.x, any parser created using Racc will work without additional gestures. However, in order to be able to create your own parsers, you must install the Racc package containing the parser generator. This is done standardly via gem:

# gem install racc

Creating a parser on Racc

The source for the Racc parser generator is a file (usually with the extension .y) that contains the definition of the Ruby parser class and includes additional Racc directives. The following elements must be present in the parser definition:

Grammar terminal character definitions - tokens
Grammar and convolution rule definitions
Lexical analyzer

Tokens

So, the first thing to do is to distinguish in the grammar the so-called terminal symbols - elementary particles from which more complex constructions are composed, and give them names. We get something like this:

T_OBJECTID - object identifier
T_IDENTIFIER - attribute identifier
T_STRING - text string
T_NUMBER - number
T_BOOLEAN - Boolean value
T_NULL - empty value
T_REFERENCE - link
T_LBR, T_RBR, T_LPAR, T_RPAR, T_EQ, T_COMMA, T_SEMICOLON - punctuation marks: different brackets, equal sign, comma, semicolon

In the parser class definition, valid tokens are defined using the token keyword .

Lexical analyzer

The next step is to create a lexical analyzer. Its functions include scanning the incoming stream and identifying tokens in it. The lexical analyzer should transform the incoming stream into a set of consecutive tokens. Racc, when performing a shift operation, should receive a new token converted from the lexical analyzer from the incoming stream. He does this by calling the next_token method on the parser class , which should return a structure containing the name and value of the token:

[:TOKEN_NAME, 'Token value']

As a label for the end of the stream is the value:

[false, false]

It is convenient to use the StringScanner class to create a lexical analyzer . Sometimes it’s important to choose the correct scanning order of templates, as some tokens may overlap others. In this example, the lexical analyzer performs full processing of the entire incoming stream immediately at the start of the parser, calling the tokenize method , and storing the resulting tokens in an array, from which the next_token method gets the next token every time .

Grammar rules

The next step is to describe the grammar rules. In Racc, this is done using the rule directive , following which the rules are set using the Backus-Naur form. Nonterminal characters are expressed through terminal and nonterminal. For example, in the following fragment the nonterminal attribute is specified - the attribute of an object, which, in accordance with our grammar, is an identifier, an equal sign following it, an arbitrary value expressed by another nonterminal, and a terminating semicolon:

attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } };

The curly brackets after the rule specify the Ruby expression, which should convolve the given sequence of characters before the declared nonterminal, in this case, attribute . The variables val and result are predefined by Racc. The val array contains the values of the collapsible sequence, in this case val [0] contains the value T_IDENTIFIER, val [2] the value value. The result variable is placed into the result variable.

To start the parser, you must call the do_parse method . This method is defined in the Racc :: Parser class , from which our parser class will be inherited.

Below is the full source code for the parser definition.

parser.y

class Parser
token	T_OBJECTID T_STRING T_REFERENCE T_IDENTIFIER T_NUMBER T_BOOLEAN T_NULL
token	T_LBR T_RBR T_LPAR T_RPAR T_EQ T_COMMA T_SEMICOLON
start input
rule
	input: object { @result = val[0] };
	object: T_OBJECTID T_LBR attributes T_RBR {
	        oid = val[0].split(':');
	        result = {
	            :id => dequote(oid[0]),
	            :name => convertNULL(dequote(oid[1])),
	            :type => dequote(oid[2])
	        }.merge(val[2])
	};
	attributes:   attribute { result = val[0] } 
				| attributes attribute { result = val[0].merge(val[1]) }
				;
	attribute: T_IDENTIFIER T_EQ value T_SEMICOLON { result = { val[0] => val[2] } };
	values:   value { result = [val[0]] }
			| values T_COMMA value { val[0] << val[2] }
			;
	value:    T_STRING { result = dequote(val[0]) }
			| T_REFERENCE { result = dequote(val[0]) }
			| T_NUMBER { result = val[0].to_i }
			| T_BOOLEAN { result = val[0] == 'T' }
			| T_NULL  { result = nil }
			| T_LBR object T_RBR { result = val[1] }
			| T_LPAR values T_RPAR { result = val[1] }
			;
---- header ----
require 'strscan'
---- inner ----
def tokenize(text)
	tokens = []
	s = StringScanner.new(text)
	until s.eos?
		case
			when s.skip(/\s+/)
				next
			when s.scan(/\A[\w\.]+:\"*\w*\"*:\w+/)
				tokens << [:T_OBJECTID, s[0]]
				next
			when s.scan(/\A\{/)
				tokens << [:T_LBR, nil]
				next
			when s.scan(/\A\}/)
				tokens << [:T_RBR, nil]
				next
			when s.scan(/\A\(/)
				tokens << [:T_LPAR, nil]
				next
			when s.scan(/\A\)/)
				tokens << [:T_RPAR, nil]
				next
			when s.scan(/\A\=/)
				tokens << [:T_EQ, nil]
				next
			when s.scan(/\A\,/)
				tokens << [:T_COMMA, nil]
				next
			when s.scan(/\A\;/)
				tokens << [:T_SEMICOLON, nil]
				next
			when s.scan(/\w+/)
				if (s[0].match(/[0-9]+/))
					tokens << [:T_NUMBER, s[0]]
				elsif (s[0].match(/\A[TF]{1}\Z/))
					tokens << [:T_BOOLEAN, s[0]]
				elsif (s[0].match(/\ANULL\Z/))
					tokens << [:T_NULL, nil]
				else
					tokens << [:T_IDENTIFIER, s[0]]
				end
				next
			when s.scan(/(["]).*?(?/)
				tokens << [:T_REFERENCE, s[0]]
				next
			else
				s.getch
				next
		end
	end
	tokens << [false, false]
	return tokens
end
def parse(text)
	@tokens = tokenize(text)
	do_parse
	return @result
end
def next_token
  @tokens.shift
end
def dequote(text)
	text.gsub(/\A["<]|[">]\Z/, '').strip
end
def convertNULL(text)
	text.upcase == "NULL" ? nil : text
end

The .y file is not a parser yet. The parser needs to be generated using the racc utility , and it needs to be done every time after changing the parser definition in the .y file:

# racc parser.y -o parser.rb

The resulting file with the parser class can be connected and used in the usual way:

main.rb

require 'pp'
require './parser.rb'
parser = Parser.new
obj = parser.parse(File.read("data.txt"))
puts obj.pretty_inspect

This will be the result of the parser - the comprehensive Ruby structure:

output

{
    :id => "7ghoJdyGAqACCgeT",
    :name => "User",
    :type => "Class",
    "FromEndRelationships" => ["vdHoJdyGAqACCgfe", "9bToJdyGAqACCgfF"],
    "_masterViewId" => "7ghoJdyGAqACCgeS",
    "pmLastModified" => "1355667704781",
    "pmAuthor" => "author",
    "Child" => [
        {
            :id => "UwZoJdyGAqACCgei",
            :name => "name",
            :type => "Attribute",
            "visibility" => 71,
            "pmLastModified" => "1355667655234",
            "pmAuthor" => "author",
            "type" => "_n2oJdyGAqACCgXh",
            "pmCreateDateTime" => "1355667628234",
            "_modelViews" => nil,
            "_modelEditable" => true
        },
        {
            :id => "9lZoJdyGAqACCgel",
            :name => "created",
            :type => "Attribute",
            "visibility" => 71,
            "type_string" => "date",
            "pmLastModified" => "1355667655234",
            "pmAuthor" => "author",
            "pmCreateDateTime" => "1355667630703",
            "_modelViews" => nil,
            "_modelEditable" => true
        },
        {
            :id => "nLFoJdyGAqACCgeo",
            :name => "active",
            :type => "Attribute",
            "visibility" => 71,
            "pmLastModified" => "1355667655234",
            "pmAuthor" => "author",
            "type" => "_n2oJdyGAqACCgXY",
            "pmCreateDateTime" => "1355667639609",
            "_modelViews" => nil,
            "_modelEditable" => true
        }
    ],
    "pmCreateDateTime" => "1355667607671",
    "_modelViews" => [
        {
            :id => "4QhoJdyGAqACCgeU",
            :name => "View",
            :type => "ModelView",
            "container" => "hguoJdyGAqACCgeL",
            "view" => "7ghoJdyGAqACCgeS"
        }
    ],
    "_modelEditable" => true
}

Tags: