degs March 15, 2016 at 03:51

Parsing a Function from the D Standard Library

Tutorial

Hi Habr, I want to invite everyone on a short tour of the D. language. Why? Well, why do people generally go on excursions - to have fun, to see something new and in general - it is interesting. D can hardly be called new or at least young, but over the past couple of years there has been rapid development, Andrei Alexandrescu came to the community and quickly became a leading developer, with his ability to anticipate trends, he made a huge contribution to the concepts of the language itself and especially to the standard library.

Since its inception, D has positioned itself as improved C ++ (at least in my reading of the story). The ability to discard some obsolete constructions and introduce something new that could not be implemented in classical C ++, and at the same time carefully preserve low-level capabilities, such as the built-in assembler, pointers and the use of C libraries, make D a unique contender for the title of “next in the series C - C ++ - ... ". Well, this is from my point of view, I myself (probably it would be polite to add “unfortunately”) is completely monolingual, I have been writing in C ++ for many years and any attempts to get acquainted with other languages inevitably ended in a good healthy dream. However, I heard from representatives of other faiths that D is also interesting for them as a language, so I invite everyone on a tour.

What will I show? Several very good books have already been written on D , so I decided to just take the getopt () function from the standard library and look at its code, an invaluable useful exercise to revive what I read in books. Why exactly this feature? Well, she is familiar to everyone and systemically independent, I personally use it 3-4 times a week and imagine in detail how it could be written in 3 different languages. In addition, the author of the code is Alexandrescu, I have seen many examples of training examples of his code in books and have never seen code written in productioncurious though. In the end, I certainly could not resist and wrote my bike (naturally improved), in this case it is completely appropriate and no less useful than parsing someone else's code.

We will see by no means all of what it would be worth seeing, and I myself am far from an expert, so read for yourself who cares, the links are at the end.

Outdoor inspection

Here is some code that illustrates how to use the function:

void main(string[] args)
{
	// placeholders
        string file;
	bool quiet;
	enum Count { zero, one, two, three };
	Count count;
	int selector;
	int[] list;
	string[string] dict;
	std.getopt.arraySep=",";
	auto help=getopt(args, 
		, std.getopt.config.bundling
		, "q|quiet", "opposite of verbose", &quiet
		, "v|verbose", delegate{quiet=false;}
		, "o|output", &file
		, "on", delegate{selector=1;}
		, "off", delegate{selector=-1;}
		, std.getopt.config.required, "c|count", "counter", &count
		, "list", &list
		, "map", &dict
	);
	if(help.helpWanted)
	    defaultGetoptPrinter("Options:", help.options);
}

The first thing we see is “almost C”, then we notice the presence of dynamic arrays - string [] and int [], and associative arrays - string [string]. Then some suspicious assignment - std.getopt.arraySep = "," , is it really a global variable!?, Did we come to the Kunstkamera or where? That's right, dynamic and associative arrays are present in the language and make up one of its foundations (I personally immediately recall Perl, in a good sense of the word). But std.getopt.arraySep is really a global variable belonging to the module and assigning it is probably terrible from the point of view of a purist, even in such a specific function as getopt () . However, everything is not so clear here, arraySep could be defined as a pair of functions:

@property string arraySep() { return ... }
@property void arraySep(string separator) { .... }

and look like a variable while meeting the most stringent data encapsulation standards. This is a kind of branded chip D - syntactic sugar brought to perfection and forming a unique look of the language. Moreover, this call might look like

",".arraySep;

Seems a far-fetched perversion? What about this design:

auto helloWorld="dlrowolleh".reverse.capitalize_at(0).capitalize_at(5).insert_at(5,' ');

this is of course a speculative example, just to show that such a syntax makes sense, however this construction is used in D as widely and with the same success as pipe (sign |) in bash scripts. It has its beautiful name: Uniform Function Call Syntax , although in fact it is nothing more than syntactic sugar that allows you to call fun (a, b, c) like a.fun (b, c) .
Next we see the function call itself and the incredible flexibility of the interface immediately catches your eye, an arbitrary number of configuration parameters, including an arbitrary handler and description, are passed directly to the function. The suspicion that D is a language with dynamic typing involuntarily creeps in. Nothing of the kind, as we will see later, is just a perfect template technique.
In general terms, the description of the option is given by the following line:
[modifier,] option options, [description,] & handler
The most trivial part here is the options, just a line of the form “f | foo | x | something-else” that defines possible synonyms, both short and long. The description (syntax help line) is also just a line, but it is no longer necessary, which already involves some work with types at the compilation stage.
The real magic starts with a handler, it should be an address, but the address of almost anything including enum (in this place, my internal C ++ nickname wrinkled my forehead), as well as the address of the function or lambda function (well, it's simple, right?).
More details:

if a pointer to bool is specified as the handler , the option is implied without arguments, -f or --foo will write true to the variable. However, you can do this: --foo true or --foo = false.
if the handler is a pointer to a string, a numeric type, or enum , an option with an argument is expected, which is converted to the desired type and assigned to the pointer.
one more option, if the handler is a pointer to an integer type, and the option ends with '+', then the handler is incremented every time the option appears on the command line.
if the handler is a pointer to an array, then this implies an option with an argument that is converted to the desired type and added to the array; you can also give several values separated by commas. After parsing the command line --foo = 1,2,3,4,5 the corresponding array will be [1,2,3,4,5].
in the same way, you can pass a pointer to an associative array , then as a parameter you need to pass a list of pairs=which will be converted to the desired type.

The function returns a tuple of two elements - a list of options that can be printed, and the logical variable helpWanted, = true if the -h or --help option was present on the command line (which is automatically added to the list).
Well, to complete the picture, each option can be preceded by a modifier, for example, required or caseInsensitive . In addition, several global variables are defined in the module, such as optionChar = '-', endOfOptions = "-" and arraySep = ",", the assignment of which changes the command line syntax.
As a result, we get a universal and convenient function, it is obvious that this is a template and it is approximately clear how to implement something similar in C ++, but how exactly is this done in D?

Open the hood

The first thing that attracts attention is an extremely simple and natural way of defining template functions, the difference in the syntax of ordinary and template functions is so subtle that it changes perception - you write not “ordinary” and “template” functions, but simply functions, some of whose formal parameters can to be template. Looking ahead, I will say that opts arguments can be accessed as an array - opts [0], opts [$ - 1] or opts [2..5];

GetoptResult getopt(T...)(ref string[] args, T opts)
{
    ...
    getoptImpl(args, cfg, rslt, opts);
    return rslt;
}

Actually, there’s nothing more to say about the top-level function, because it immediately transfers control to getoptImpl () into which we now take a look.

 1 private void getoptImpl(T...)(ref string[] args, ref configuration cfg, ref GetoptResult rslt, T opts)
 2 {
 5     static if(opts.length) {
 6         static if(is(typeof(opts[0]) : config)) {
 7             // it's a configuration flag, act on it
 8             setConfig(cfg, opts[0]);
 9             return getoptImpl(args, cfg, rslt, opts[1 .. $]);
10         } else {
11            // it's an option string
                ...
16             static if(is(typeof(opts[1]) : string)) {
17                auto receiver=opts[2];
18                 optionHelp.help=opts[1];
19                 immutable lowSliceIdx=3;
20             } else {
21                 auto receiver=opts[1];
22                 immutable lowSliceIdx=2;
23             }
                 ...
34             bool optWasHandled=handleOption(option, receiver, args, cfg, incremental);
41             return getoptImpl(args, cfg, rslt, opts[lowSliceIdx .. $]);
42         }
43     } else {
44         // no more options to look for, potentially some arguments left
            ...
68         }
75     }
76 }

As you can see by the numbers, I threw away quite a few lines, but the whole structure of this code was in full view.
The first thing that attracts attention is the construction of static if () {} else static if () {} else {} , yes, this is exactly what you probably thought about. The branch of the static if expression is selected at compile time , naturally the condition must also be known at compile time. Thus, this code (slightly giving away spaghetti for my picky taste) during compilation is cut to several lines that make sense for this particular set of function arguments. As I said before, template parameters can be handled like an immutable array, static if (opts.length)returns 0 if the option list is empty, so the code from line 43 replaces the template specialization for this case.
Another interesting point, braces after static if () do not change the scope , take a look:

16             static if() {
19                 immutable lowSliceIdx=3;
20             } else {
22                 immutable lowSliceIdx=2;
23             }
41             return getoptImpl(args, cfg, rslt, opts[lowSliceIdx .. $]);

The lowSliceIdx variable is defined in one of the blocks, but it is used outside of them, it is very logical in my opinion. Since this variable is defined as immutable (= constexpr) , it is also available at compile time and can be used in templates.
Let's take a deeper look where the analysis of options and the actual work with types begin:

 6         static if( is(typeof(opts[0]) : config)) {
 7             // it's a configuration flag, act on it
 8             setConfig(cfg, opts[0]);
 9             return getoptImpl(args, cfg, rslt, opts[1 .. $]);
10         } else {
               ......
42         }

Ohhh, there it is! In D, they made the long-awaited typeof (expr) in C ++ and it works just as intended. But that’s not all, the expression is (T == U) is true if and only if (naturally at compile time) when the types T and U are equal, and with template parameters and other use cases, is simply turned into a Swiss knife for working with types. Generally speaking, is () is an inline SFINAE returning true if and only if the argument is any type, that is, the expression is syntactically correct. For example, is (arg == U [], U) checks that arg will be an array, and is (arg: int)- that arg can be automatically converted to int , the colon unobtrusively hints at inheritance. Later there will be more examples. Thus, the expression on line 6 statically checks if the type of the first parameter ( typeof (opt [0]) is cast to a certain type of config . And config is simply an enumeration of all possible option modifiers:

enum config {
    /// Turns case sensitivity on
    caseSensitive,
    /// Turns case sensitivity off
    caseInsensitive,
    /// Turns bundling on
    bundling,
    /// Turns bundling off
    noBundling,
    /// Pass unrecognized arguments through
    passThrough,
    /// Signal unrecognized arguments as errors
    noPassThrough,
    /// Stop at first argument that does not look like an option
    stopOnFirstNonOption,
    /// Do not erase the endOfOptions separator from args
    keepEndOfOptions,
    /// Makes the next option a required option
    required
}

after which getoptImpl () saves the value (saves + value => runtime) of the modifier and recursively re-calls itself by removing the first argument ( opt [1 .. $] ) from the options . Thus, we figured out the first case of type processing and it turned out surprisingly simple. If you throw these endless compile time / runtime out of your head and read the code as it is, and when you encounter typeof (T), look up a couple of pages up to where this type is defined (in our case, in the list of actual parameters getopt () , it’s even offensive it’s just that in C ++ it’s much more like magic. Or maybe it was intended? After all, the compiler has all the same information as me - in the form of input code.
Then, recursively pulling out elements one by one from the input array, the compiler will get to the first string parameter, which must be a list of tags for this option, line 11. Here options begin, which again are very easily resolved: if the second (next) parameter is a string, then this is the description, and the third, respectively, is the address of the handler; otherwise (not a string), this is a handler. Accordingly, we pull out either three or two parameters from the list and pass them to the next function - handleOption (), which already parses the command line itself, and then naturally calls itself recursively and it starts all over again.
Nothing new happens next compared to what we have already seen. Function handleOption (), a template with a single parameter - the type of handler, runs along the entire command line, checking whether it fits the description and, if it finds it, performs an action corresponding to its handler. I will briefly consider the most interesting points from my point of view.
First, a general top view:

static if(is(typeof(*receiver) == bool)) {
    *receiver=true;
} else {
// non-boolean option, which might include an argument
    static if(is(typeof(*receiver) == enum)) {
        *receiver=to!(typeof(*receiver))(val);
    } else static if(is(typeof(*receiver) : real)) {
        *receiver=to!(typeof(*receiver))(val);
    } else static if(is(typeof(*receiver) == string)) {
        *receiver=to!(typeof(*receiver))(val);
    } else static if(is(typeof(receiver) == delegate) || is(typeof(*receiver) == function)) {
        // functor with two, one or no parameters
        static if(is(typeof(receiver("", "")) : void)) {
            receiver(option, val);
        } else static if(is(typeof(receiver("")) : void)) {
            receiver(option);
        } else {
            static assert(is(typeof(receiver()) : void));
            receiver();
        }
    } else static if(isArray!(typeof(*receiver))) {
        foreach (elem; ...)
            *receiver ~= elem;
    } else static if(isAssociativeArray!(typeof(*receiver))) {
        foreach (k, v; ...)
            (*receiver)[k]=v;
    } else {
        static assert(false, "Dunno how to deal with type " ~ typeof(receiver).stringof);
    }
}

Repeatable design

static if(is(typeof(*receiver) == ...)) {
    *receiver=to!(typeof(*receiver))(val);

actually means "if a pointer to something is passed as a handler, try to convert the argument to this type and assign it to the pointer."
Separately processed pointers to bool , which may have no argument; arrays and associative arrays, where the argument is added to the container; as well as functions and lambda functions that can have one, two or no arguments. Note the internal function type selector:

        static if(is(typeof(receiver("", "")) : void)) {
            receiver(option, val);
        } else static if(is(typeof(receiver("")) : void)) {
            receiver(option);
        } else {
            static assert(is(typeof(receiver()) : void));
            receiver();
        }

This is another use case for the expression is (T) ; it can only be true if T is some existing type. In this particular case, it looks at the type returned by the functions (* receiver) (), (* receiver) ("") or (* receiver) ("", "") , if such a signature of the function exists, the type also exists, otherwise - SFINAE . ( void is a full-fledged type)
It is also useful to get acquainted with the universal converter D from the std.conv module : to! (T) (), it works like boost :: lexical_cast but, in contrast to it, it is even able to convert a string to enum since D shamelessly uses all the information available at compile time, which we see in the code above.
That's all, in about 400 meaningful lines of code a rather complicated function is implemented, and with a result that is very difficult, if at all possible, to reproduce in C ++. Well, in turn, we got acquainted with the features of working with types in D - template functions with a variable number of arguments, the choice of type and code branch at compile time, and also with type conversion. In fact, this is only a small part of the tools that D offers developers, the site has a huge collection of articles on a variety of topics. I do not urge anyone to switch to D or to learn D, but if you still have a spark of curiosity and interest in the new, this is certainly the language that you should get to know at least superficially.

Criticism of Pure Reason

However, I can not resist criticism, I definitely do not like something in the proposed implementation. By and large, this has nothing to do with the language itself, nevertheless it is interesting to discuss from a general programming perspective.
Firstly, this implementation is made single-pass , that is, the option is retrieved from the list and a command line pass is immediately made, the first match found breaks the loop. This means that you cannot write -qqq as a synonym for "quieter, quieter, even quieter", or --map A = 1 --map B = 2 --map C = 3 instead of --map A = 1, B = 2, C = 3 . This is generally not a bug, but it violates some established conventions when using getopt () and I would like to see more traditional behavior.
Secondly, and this is already a serious architectural error in my opinion, the function returns a certain structure with syntactic help, which is usually printed using the -h | --help key , but this function also throws an exception in case of an error. That is, if you made a mistake on the command line, the program will no longer be able to tell you how to. Generally speaking, this comes from the same single-pass implementation.

UPD: Does Alexandrescu read Habr?

In the last commit, this was fixed, not quite as I would have done, but nonetheless.

In addition, there are a few minor flaws, for example, the option can have as many synonyms as you like, but only the first two fall into the syntax help: in the option “x | abscissa | initialX”, the last value can be found only by looking into the code. Well and the like annoying little things.
Therefore, I made my own implementation as an exercise, where I fixed these shortcomings and made various of my bells and whistles (exclusively as an exercise), in general I had fun as I wanted.

Here was my bike! Where is my bike?

But no, the bicycle is now there , it happens. I decided that a good guide should know where to stay, so this is the end of the tour.
I hope it was interesting

Bibliography

I read the first three books, so I give each one in a separate number. They are all good, but none of my taste is perfect, so I read with a sandwich a chapter from one, the corresponding chapters from the other two.

Tags: