Lightweight parser designer with interactive mode

Occasionally confronted with small tasks to develop simple text analyzers, I decided to automate this task, and the academic interest did not give rest. Initially, I looked towards Racc (one of Yacc's interpretations ), but it seemed to me a rather difficult solution for my small tasks, and then I decided to develop my simple analyzer. Although, of course, if you are developing a compiler or something similar and even on an industrial scale, then you should definitely look towards Racc .

But if you want to figure out for yourself what a parser is and how to write it yourself yourself quickly, without reading a bunch of articles about lexical analyzers like a dragon’s book, then go ahead under the cut (although the book is very good).

Input stream analysis


If we consider the problem of analyzing the input stream in a very abstract way, and I think we should start from this, then it all comes down to the implementation of a finite state machine, i.e. state machines, where the condition for the transition to a certain state is the presence in the input stream of the necessary data (hereinafter referred to as templates). In turn, the transition to the state can be performed only if the characters of the input stream are unambiguous and completely coincide with a specific template, why it is unambiguous, because from one state, it is possible to switch to several states at once, and while the input symbols of the stream coincide with several state templates at the same time, the analyzer is in a state of uncertainty. Based on the foregoing, the analyzer itself may be in the following states:

  • state of certainty.
  • state of uncertainty.
  • state change.

Of course, erroneous conditions are also possible, but since we consider the generalized analyzer model and don’t know on which character sets the error should be generated, then this task rests with the concrete implementation.

In the state of certainty, we know exactly which state handler we need to call, but in the state of uncertainty we do not know what the next state will be, i.e. which template will match or in the end there will never be a unique match, then in this case we should buffer the input stream and then transfer the characters from the accumulated buffer to an already defined state, i.e. in case of uncertainty, we buffer the input stream and already in the state transition mode, we must transfer the buffer data to a new state or return to the current one if the transition never happened.

Now, it is possible to decompose, more precisely, detail the process of analyzing the input stream into:
classification of the input stream (coincides with one, with many and does not coincide with any template) and processing of the classification result (transition, buffering, etc.).

Implementation


Although I am a developer of embedded systems by my main occupation and my main programming language is ANSI C, I decided to implement this model in Ruby, as you think less about the technical details of the implementation on it and planned to implement this model as a Ruby Gem. I really love everything universal (within a certain framework), short and concise. In the future, I plan to implement this model in emdedded for the input traffic analyzer, but in a more simplified form and possibly on HDL.

So, let’s begin, I repeat once again, I don’t know Ruby very deeply, therefore I made a bet on the accumulated experience and a clear algorithmization of the process in my head, therefore I thought that my Ruby knowledge was enough for this, as my music teacher once said: “Learn to improvise and make music on three to five notes, and then you’ll thresh cool passages, otherwise it’s not possible to listen to your continuous arpeggio.” Sometimes remembering him, now I think that he was right, although then I wanted cool solos, okay, back to the topic.

I believe the development process should go in stages, with the stages being “tangible”, i.e. you must understand what needs to be done to implement this stage and in what volumes (laboriousness), I won’t go deeply about this now, not this topic of our conversation, but for the implementation of the parser, its building and testing should go along the gradual build-up of code with each state. And here I could not do without interactive mode, if someone uses my gem, I think he should also initially debug his model’s behavior in interactive mode, and only then write parser state handlers.

Example


For clarity, I’ll give an example of implementing a simple source code file parser for automatically generating documentation remotely resembling doxygen. Initially, you need to install my gem:

$ gem install iparser

As you already understood, the gem is called iparser and after installing it, create a file called 'parser_example.rb', in which we will write the code for our parser based on iparser . First of all, we’ll connect the library and create the object of the parser machine, which implements the main logic of the analyzer, i.e. instance of the parser itself:

require 'iparser'
parser = Iparser::Machine.new

The next step is to describe the state of the parser. We will have a very simple parser and it will have only three states:

  • initial (idle).
  • single-line comments (comment-line).
  • multi-line comments (comment-block).

Now we can describe each state separately (below I will give the script code as a whole). I will not describe the state of 'idle' in detail since in our case it’s empty, and we’ll take a closer look at the 'comment-line' state. To create a state instance, use the following code:

ps_idle = Iparser::State.new('idle')
ps_cline = Iparser::State.new('comment-line')

In the parameters of the state constructor, a line is indicated with the name (identifier) ​​of the state, preferably understandable to you, because used for further debugging of the model. Each parser state object has an entry field , this is a pattern of coincidence with which a transition to this state is made, i.e. condition entry condition. Comparison with the pattern is done character by character, as well as input stream processing. Entering the state will be carried out if there are consecutive characters '\\\' in the input stream. There is also a leave field in order to indicate a template for exiting the state (return to the previous one), in our case these are the characters of the end of the string '\ n \ r', although it is enough to specify the first of them, but for clarity we will indicate both. Now describe the 'comment-line':

ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.leave << /[\n\r]/

Please note that patterns are set using regular expressions, it is possible without them, but this is later.

Next, we will create a state for processing multi-line comments, we will get into it when the characters '/ **' meet, and leave it if there is '* /'. Also write:

ps_cblock = Iparser::State.new('comment-block')
ps_cblock.entry << /\//
ps_cblock.entry << /\*/
ps_cblock.entry << /\*/
ps_cblock.leave << /\*/
ps_cblock.leave << /\//
ps_cblock.ignore << '*'

We also indicated characters that should be ignored while in this state. We have an asterisk symbol, because I like to write multiline comments of the form:

/**
 * ...
 * ...
 */

Now we should connect our three states into a single chain, from 'idle' we can get into 'comment-line' and 'comment-block', and from them only back to 'idle'. Linking is done by specifying state indices in the branches field of each of the states. The index will be determined by the order of adding state objects to the parser instance; to add objects to the parser, the addstate parser method is used .

ps_idle.branches << 1
ps_idle.branches << 2
parser.addstate ps_idle
parser.addstate ps_cline
parser.addstate ps_cblock

And finally, we need to check whether we created the state chain correctly, for this we will start the interactive mode (the prestart method will set all the initial parameters):

parser.prestart
parser.interactive_parser

For clarity, I will give the script code as a whole, of course I rebuilt it a bit, but it contains only the code I wrote above:

require 'iparser'
#
# Create parser-machine object.
#
parser = Iparser::Machine.new
#
# Create startup state for this parser-machine.
#
ps_idle = Iparser::State.new('idle')
#
# Add branch indexes to 'comment-line' and 'comment-block' state.
#
ps_idle.branches << 1
ps_idle.branches << 2
#
# Create single line comment state for this parser-machine.
#
ps_cline = Iparser::State.new('comment-line')
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.leave << /[\n\r]/
#
# Create multiline comment state for this parser-machine.
#
ps_cblock = Iparser::State.new('comment-block')
ps_cblock.entry << /\//
ps_cblock.entry << /\*/
ps_cblock.entry << /\*/
ps_cblock.leave << /\*/
ps_cblock.leave << /\//
ps_cblock.ignore << '*'
#
# Add all states to parser-machine.
#
parser.addstate ps_idle
parser.addstate ps_cline
parser.addstate ps_cblock
#
# Call parser startup method.
#
parser.prestart
#
# Call interactive mode for check state-machine.
#
parser.interactive_parser

Not so big code, now run the script:

$ ruby parser_example.rb

To exit interactive mode, you must enter an empty line (just press enter immediately).

Enter '\\\' and see that on the last character the parser goes into the state 'comment-line' ( branch to comment-line ), now we enter '\ n' or '\ r' and we see that the parser goes back to the state ' idle '( back ). The character '\' is used to enter esc characters, so to enter the character '\' you need to type it twice '\\', as well as when specifying strings in C and other languages. In interactive mode, you can indulge as much as you like until you make sure that the chain of all states is connected correctly.

The final step, given that we checked and debugged the transition logic of our automaton, you can add handlers for our states (and only now, as I wrote above about the development process), they will be very simple. Each state can have three handlers:

  • init: state initializer (state constructor).
  • handler: state handler.
  • fini: state finalizer (state destructor).

The state constructor will be called only once upon entering the state and will receive as an argument an array of buffered characters that the parser has accumulated while it was in the uncertainty mode. The handler will be called constantly and will receive an input stream symbol as an argument until it leaves the state, and the destructor is called only once upon exiting the given state.

The handler method can also control the operation of the parser, although it may be that I implemented it in vain, as long as I have it, so I need to describe it. If the handler returns a Fixnum data type whose value is in the range (> = 0), then this will be regarded as an index for transition to some state, if the index goes beyond the boundaries of the state array, the parser will throw an exception. If the handler returnsnil , the parser will hold this state, i.e. no reaction and if any other value, then this is regarded as an error and the parser will return false , which signals a parsing error, because in all other cases, it returns true . Below I will give the modified code completely, as Like and so I tried to describe everything in detail.

require 'iparser'
#
# Simple check startup arguments.
#
if( ARGV.size != 1 || !File.exist?(ARGV[0]) )
  puts
  puts "ERROR: unable to open file #{ARGV[0]}"
  puts
  exit
end
#
# Create output file.
#
$fout = File.new( 'index.html', 'w' )
#
# Create initializer method for parser-states.
#
def doc_init ( str )
  $fout.print "

" end # # Create handler method for parser-states. # def doc_handler ( c ) $fout.print c end # # Create finalizer method for parser-states. # def doc_fini ( str ) $fout.puts "

" end # # Create parser-machine object. # parser = Iparser::Machine.new # # Create startup state for this parser-machine. # ps_idle = Iparser::State.new('idle') # # Add branch indexes to 'comment-line' and 'comment-block' state. # ps_idle.branches << 1 ps_idle.branches << 2 # # Create single line comment state for this parser-machine. # ps_cline = Iparser::State.new('comment-line') ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.entry << /\// ps_cline.leave << /[\n\r]/ # # Create multiline comment state for this parser-machine. # ps_cblock = Iparser::State.new('comment-block') ps_cblock.entry << /\// ps_cblock.entry << /\*/ ps_cblock.entry << /\*/ ps_cblock.leave << /\*/ ps_cblock.leave << /\// ps_cblock.ignore << '*' # # Add handlers for states. # ps_cline.init( method(:doc_init) ) ps_cline.handler( method(:doc_handler) ) ps_cline.fini( method(:doc_fini) ) ps_cblock.init( method(:doc_init) ) ps_cblock.handler( method(:doc_handler) ) ps_cblock.fini( method(:doc_fini) ) # # Add all states to parser-machine. # parser.addstate ps_idle parser.addstate ps_cline parser.addstate ps_cblock # # Call parser startup method. # parser.prestart # # Call interactive mode for check state-machine. # $fout.puts "" $fout.puts "" File.open( ARGV[0], 'r' ).each do |line| line.each_char do |c| parser.parse(c) end end $fout.puts "" $fout.puts "" $fout.close

Yes, please note that our handler (doc_handler) does not return nil , because we do not analyze the result of the parser ( parser.parse method ). For the final check, create a test file named 'test.c' and fill it with the following contents:

#include 
///Test function - 1.
void test1 ( void )
{
}
/**
 * Test function - 2.
 */
void test2 ( void )
{
}

Run our script for the last time and see the result of our parser.

$ ruby parser_example.rb test.c

This is the file 'index.html', that's all, thank you all for your attention!

I hope that it will help someone and reduce the time to study the basic principles of analyzers, or maybe someone will want to use my gem for personal purposes without finding a simpler option, if you don’t like it, don’t kick much, although no, criticize because you don’t become a good creator without good criticism. If anyone is interested, here is a link to gem iparser .

Also popular now: