Creating a state machine for parsing an HTTP request

A deterministic finite state machine can be used to implement a very fast way to parse an input sequence. It takes only one pass in the input sequence, and minimal actions at each step. Unfortunately, this model has limitations - it is not always possible to construct a DFA for an existing Nondeterministic finite state machine (regular expression, grammar). Or even if it is possible to build, an automaton may have too many states.

Nevertheless, I decided to try to create a parser for the HTTP request based on the DCA. The main task is not just to check the correctness of the HTTP request, but to select elements in the input line that correspond to certain values ​​of the HTTP request fields. An automaton must be generated from the BNF rules (scattered across) RFC2616. Everything is implemented in C #, the output machine is also in C #. Although it is clear that when the machine is ready to generate it in any language, in any form is not a problem.

The following chain of actions was implemented to generate DFAs:
1. Analysis of BNF rules
2. Creation of NKA based on them
3. Conversion of NKA into DKA
4. Minimization of DKA
5. Creating files with source text and conversion table

I used Irony to parse BNF grammar, ABNF grammar is well described in RFC5234. According to the syntax tree, the program builds the NKA according to the rules of Thomson. Standard algorithms were used to convert an NFA to a DFA and minimize it; I will not describe them either.

Since the HTTP message is a relatively simple object, the output of the parser is a class (this is enough), with fields corresponding to the HTTP headers. In order for the machine to parse the message during the passage, simple actions modifying the class variables are added to the necessary states in the DCA during generation. For example, you need to get the value of the Host header, in the class there are two variables, the beginning and the end of the header. In states that are triggered when the machine goes by the value of the Host field, actions are added to change these variables. In DKA itself, it is impossible to determine which state is responsible for what, in practice. Therefore, the states are marked when creating the NKA, and then these labels are saved, with all transformations of the automaton. To simplify marking states and assigning actions, a simple language was made, in which the names of the variables are set and what values ​​they will receive. Based on this data, a class is then generated into which the parser puts the result of the work.

A similar approach is used in regular expressions (although not all implementations do the conversion to DCA). But having direct access to the machine, it is possible to make a number of interesting optimizations. For example, the Content-Lenght field contains the length of the message body; for this field, conversion to number occurs directly in the machine; at the output, we have a field of the int ContentLenght class. Another interesting optimization, for example, a set of methods (GET, HOST ...) is defined for an HTTP request. At the output of the parser, you can immediately get the constant corresponding to the method. And all this in one pass.

It is also very important that the parser accepts an array of bytes as an input, this is especially true for modern languages ​​like C #, where a string is not an array of bytes as in C ++, and you can’t do with simple pointer casting here.

In general, the system is as follows. There is a BNF grammar, and a description of the class that we want to get on the output. All this is fed to the input of the generator, and we get the DFA in C #. With a ready-made machine, you can work as follows: In practice, a DCA for an HTTP request contains up to a minimum of 150K states, after 1K-2K. It is clear, firstly, that 1K-2K is quite acceptable for real use. Secondly, it’s 150K states, it requires some memory and time to calculate (at least in my implementation) - minutes x-tsat. The parser speed is very high, a formal rough estimate for this approach is ~ С * O (n), and it is probably difficult to fundamentally improve it. In reality, the parser in C # on my machine managed to parse a message of 220 bytes in length 1M in 3.3 seconds.

dfa.SetDefaultValue();
dfa.Parse(message0, 0, message0.Length);
dfa.SetArray(message0);

Console.WriteLine("Method: {0}", dfa.Method);
Console.WriteLine("Request-URI: |{0}|", dfa.RequestUri.ToString());
Console.WriteLine("Host: |{0}|", dfa.Host.ToString());
Console.WriteLine("Content-Type: |{0}|", dfa.ContentType.Value.ToString());
Console.WriteLine("Content-Length: {0}", dfa.ContentLength);
Console.WriteLine("Referer: |{0}|", dfa.Referer.ToString());






The source code of the project is available: code.google.com/p/siprevo I

tried to make the code at the output of the generator accurate, but the generator itself is slightly (if not to say more) in draft form. It’s good (just not during debugging) that such systems can hardly glitch slightly, it either works or does not work, in this case it works.

Update: the “compiler” of the machine in the form of a separate utility lives here .

Also popular now: