Little about lexical analysis

    Long time ago, when the sky was blue, the grass was greener and dinosaurs roamed the Earth ... No, forget about dinosaurs. Well, in general, once the thought occurred to take a break from standard web programming and do something more insane. It was possible, of course, anything, but the choice fell on the writing of his interpreter. What can I say ... Never write your programming languages . But I learned some experience from all this, so I decided to share it. Let's start with the very basics - lexer.


    Before you begin to understand what kind of animal a “lexer” is, it is worth figuring out what a PL consists of.

    In the modern world, each compiler / interpreter / transpiler / something-there-yet-like (let's, I just call it “the compiler”, without any distinctions on types) is divided into two pieces. In the terminology of smart uncles, such pieces are called “frontend” and “backend”. No, it’s not at all working with the web that we used to call and the front is not written in JS with HTML. Although ... Okay.

    The task of the first, frontend - to take the text and turn it into AST(abstract syntax tree), on the way, checking the correctness of the syntax (and sometimes semantics). The task of the second backend is to make it all work. If the code is collected inside the interpreter, then a set of commands for the virtual processor (virtual machine) is created from AST, if the compiler, then a set of commands for the real processor. In life, everything is quite complicated and can be realized not quite. For example, in the case of the GCC compiler, everything is mixed there, but Clang is more canonical, LLVM is a typical representative of the backend for compilers.

    Now let's take a look at a piece called the "frontend."

    Lexical analysis

    The task of the lexer and the stage of lexical analysis is to get a lot of letters to the input and group them into some categories - “tokens”. Therefore, lexical analysis is also called “tokenization”. This is the very first stage of text processing that each existing compiler produces.

    Like this:

    $tokens = ['class', '\w+', '}', '{'];
    var_dump(lex('class Example {}', $tokens));
    // array(4) {//  [0] => string(5) "class"//  [1] => string(7) "Example"//  [2] => string(1) "{"//  [3] => string(1) "}"// }

    By the way, here we have already written a lot of tools to make life easier. The same preg functions that we used to use to parse text cope well with this task. However, there are more convenient tools for this business:

    • Phlexy , written by Nikita Popov.
    • Hoa is a toolkit consisting of Lexer + Parser + Grammar.
    • The port of Yacc , written by Anthony Ferrara, which is also a comprehensive toolkit, and on which is written the well-known PHP parser Popov, used in tools that use code analysis.
    • Railt Lexer is my implementation for PHP 7.1+
    • Parle is an extension for PHP that allows a limited set of PCRE expressions (no lookahead and some other syntactic constructs).
    • And finally, the standard function php token_get_all , which is designed directly for lexical analysis of PHP.

    Okay, it's clear that things that can divide text into tokens are plenty, maybe I even forgot something, like Doctrine lexer . But what next?

    Types of lexers

    And as always, things are not as simple as they seemed. There are at least two different categories of lexers. There is the usual option, quite trivial, which you slip the rules, and he already divides everything into tokens. The configuration of this is not much different from the example shown by me above. However, there is another option called multistate . Such lexers are a bit more difficult to understand, so I want to talk about them in a little more detail.

    The multistate lexer task is to deduce different tokens depending on the previous state. Well, for example, in PHP such “transitional” states are formed with the help of <? Php +?> Tags, inside lines, comments and HEREDOC / NOWDOC constructs.

    Remember the previous example with 4 tokens above? Let's modify it a bit to understand what these states are:

        // class Example {}

    In this case, if we have the simplest lexer without the wide capabilities of PCRE, then we get the following set of tokens:

    // array(9) {//  [0] => string(5) "class"//  [1] => string(7) "Example"//  [2] => string(1) "{"//  [3] => string(2) "//"//  [4] => string(5) "class"//  [5] => string(7) "Example"//  [6] => string(1) "{"//  [7] => string(1) "}"//  [8] => string(1) "}"//}

    As you can see, we got a completely banal jamb on elements 3-5: The comment was perceived quite unexpectedly and itself was divided into tokens, although it had to be considered as a solid piece.

    Of course, if there was a PCRE functional, such a token could be ripped out with the help of a simple regular " // [^ \ n] * \ n ", but if it is not? Well, or do we want to gash with our hands? In short, in the case of the multistate lexer, it can be said that all tokens must be in group No1 , as soon as the token " // " is found , then a transition to group No2 should occur . And inside the second group there is a reverse transition if the token " \ n " is found - the transition back to the first group.

    Something like this:

    $tokens = [
        'group-1' => [
            '//' => 'group-2'// Переход после слешей в группу 2
        'group-2' => [
            "\n" => 'group-1', // Переход в группу 1 после переноса строки'.*'

    I think now it becomes clearer how some HEREDOC is parsed, because even with all the power of PCRE, writing a regular schedule for this business is extremely problematic, given that this HEREDOC syntax supports interpolation of variables. Just try parsing something with the token_get_all built-in function (note the> 12 token):

    $example = 42;
    $a = <<<EOL
        Your answer is $example !!!

    Okay, it seems we are ready to start practicing.


    Let's remember what we have in PHP for such cases? Well, of course, preg_match! Okay, come down. The preg_match based algorithm is implemented in Hoa and here in this Phelxy implementation . Its task is quite simple:

    1. We have on hand the source text and an array of regularies.
    2. Match until something is right.
    3. Once found a piece, cut it out of the text and match it further.

    As code, it will look something like this:

    Code sheet
         * @var array<string>
         */private $tokens = [];
         * @param array<string> $tokens
         */publicfunction__construct(array $tokens){
            foreach ($tokens as $name => $definition) {
                $this->tokens[$name] = \sprintf('/\G%s/isSum', $definition);
         * @param string $sources
         * @return iterable&\Traversable<string>
         * @throws RuntimeException
         */publicfunctionlex(string $sources): iterable{
            [$offset, $length] = [0, \strlen($sources)];
            while ($offset < $length) {
                [$name, $token] = $this->next($sources, $offset);
                yield $name => $token;
                $offset += \strlen($token);
         * @param string $sources
         * @param int $offset
         * @return array<string,string>
         * @throws RuntimeException
         */privatefunctionnext(string $sources, int $offset): array{
            foreach ($this->tokens as $name => $pcre) {
                \preg_match($pcre, $sources, $matches, 0, $offset);
                $token = \reset($matches);
                if (\count($matches) && \strpos($sources, $token, $offset) === $offset) {
                    return [$name, $token];
            thrownew \RuntimeException('Unrecognized token at offset ' . $offset);

    Using Code Sheets
    $lexer = new SimpleLexer([
        'T_CLASS'       => 'class',
        'T_CONST'       => '\w+',
        'T_BRACE_OPEN'  => '{',
        'T_BRACE_CLOSE' => '}',
        'T_WHITESPACE'  => '\s+',
    echo \sprintf('| %-10s | %-20s |', 'VALUE', 'NAME') . "\n";
    foreach ($lexer->lex('class Example {}') as $name => $token) {
        echo \sprintf('| %-10s | %-20s |', '"' . \trim($token, "\n") . '"', $name) . "\n";

    Such an approach is rather trivial and allows a pair of pokes on the keyboard to modify the lexer in the area of ​​the next () method, adding a transition between states and turning this section of hand-hopping into a primitive multistate lexer. In the $ this-> tokens area, it’s enough to add something like $ this-> tokens [$ this-> state] .

    However, besides primitivism itself, there is one more drawback, not fatal as it could turn out to be, but still ... Such a realization works incredibly slowly. On i7 7600k, the owner of which I by chance turned out to be - a similar algorithm processes about 400 tokens per second, and with an increase in their variations (i.e. definitions that we passed to the designer) - it can slow down to the speed of changing presidents in the Russian Federation ... ahem , sorry. I wanted to say, of course, that it would work very slowly .

    Okay, what can we do? For a start, you can understand what is going wrong. The fact is that every time we call preg_matchinside the jungle of a language, a compiler with its own JIT, called PCRE, rises (And with PHP 7.3 already PCRE2). Each time it parses the regulars itself and builds a parser for them, with which we parse the text to create tokens. It sounds a bit strange and tautological. But in short, each token requires a compilation of 1 to N regulars, where N is the number of definitions of these tokens. In this case, it is worth noting that even the applied flag " S " and optimization with the help of " \ G " in the constructor, where regular expressions for tokens are formed, do not help.

    One way out of this situation suggests itself - it is necessary to parse all this text in one pass, i.e. by executing just one preg_match function . It remains to solve two problems:

    1. How to indicate that the result of the regular expression N1 corresponds to the N2 token? Those. how to denote that " \ w + ", for example, is exactly T_CONST .
    2. How to determine the sequence of tokens as a result. As you know, the result of preg_match or preg_match_all will contain everything mixed up. And even with the help of flags passed as the fourth argument, the situation will not change.

    Here you can pause and think a little. Well or not.

    The solution to the first problem is PCRE named groups , which are also referred to as “submasks”. With the help of the rules: " (? <T_WHITESPACE> \ s + | <T_WORD> \ w + | ...) " you can get all the tokens in one pass by matching them with their names. As a result of the match, an associative array consisting of pairs " [TOKEN_NAME => TOKEN_VALUE] " will be formed .

    With the second bit more complicated. But here you can use tactical trick and use the function preg_replace_callback . Its peculiarity is that the anonymous text passed as the second argument will be called strictly sequentially, for each token, from first to last.

    In order not to torment - the implementation is as follows:

    Another code footcloth
         * @var array<string>
         */private $tokens = [];
         * @param array<string> $tokens
         */publicfunction__construct(array $tokens){
            foreach ($tokens as $name => $definition) {
                $this->tokens[] = \sprintf('(?<%s>%s)', $name, $definition);
         * @param string $sources
         * @return iterable&\Traversable<string,string>
         */publicfunctionlex(string $sources): iterable{
            $result = [];
            \preg_replace_callback($this->compilePcre(), function(array $matches)use(&$result){
                foreach (\array_reverse($matches) as $name => $value) {
                    if (\is_string($name) && $value !== '') {
                        $result[] = [$name, $value];
            }, $sources);
            foreach ($result as [$name, $value]) {
                yield $name => $value;
         * @return string
         */privatefunctioncompilePcre(): string{
            return \sprintf('/\G(?:%s)/isSum', \implode('|', $this->tokens));

    And its use is no different from the previous version. In this case, the speed of work increases from 400 to 57,000 tokens per second. It is this algorithm that I applied in my implementation , taking up the rewriting of Hoa source codes. By the way, if you use Parle, you can squeeze up to 600,000 tokens per second. And the overall picture looks something like this (with XDebug enabled in PHP 7.1, so the numbers are lower here, but the ratio can be approximated).

    • Yellow is the native extension of Parle.
    • Blue - implementation through preg_replace_callback with pre-assembled regular schedule .
    • Red is the same, but with the generated regular during the preg_replace_callback call .
    • Green - implementation via preg_match .

    What for?

    Well, all this, of course, is wonderful, but impatient people are eager to ask the question: "Who needs this at all?" In the abstract world of PHP, where the principle of “figak-figak-and-site-ready” prevails - such libraries are not needed, let's be honest. But if we talk about the ecosystem as a whole, then we can recall the well-known libraries, like symfony / yaml or Doctrine . Annotations in symfony are the same sublanguage inside PHP, requiring separate lexical and syntactic parsing. In addition, there are slightly less well-known TranspScript transporters, Less and Scss / Sass, written in PHP. Well or Yay and preprocessbased on it. I will not even mention code analysis tools like phpmd or phpcs. And documentation generators like phpDocumentnor or Sami are pretty trivial. Each of these projects, to one degree or another, uses lexical analysis at the first stage of text / code parsing.

    This is not a complete list of projects and maybe, I hope, my story will help you discover something new and replenish it.


    Looking ahead, if there is anyone interested in the topic of parsers and compilers, then there are some interesting reports on this topic, in particular from the guys from JetBrains:


    Also, of course, most of the speeches of Andrei Breslav ( abreslav ), which can be found on YouTube, I advise you to watch.

    Well, for lovers of reading, there is such a resource that was personally extremely useful to me.

    Post Post Scriptum. If somewhere it was sealed in the vastness of this epic, then you can safely inform the author in any form convenient for you.

    As a bonus, I would like to give an example of a simple PHP lexer , it seems not so scary now, and now it’s even clear what it does, right? Although whom I am deceiving, from the regulars the eyes bleed. =)

    Thank you!

    Also popular now: