More about parsing on Prolog

    Here then came across, in general, it is a simple task consists in parsing a text file containing 5 million float'ov (and the calculation of their sum). The file is generated by the following C # code:
    static void Main(string[] args)
    {
      using (Stream stm = new FileStream(@"d:\numbers_large.txt", FileMode.Create))
      {
        TextWriter wr = new StreamWriter(stm);
        System.Random r = new System.Random();
        for (int i = 0; i < 5000000; i++)
        {
          double d=10000*r.NextDouble() * (r.NextDouble() > 0.7 ? -1.0 : 1.0);
          wr.Write("{0} ", d);
        }
        wr.Flush();
      }


    The task was posed in the context of a discussion of haskell's performance in applying it to parsing tasks. I knew that at the prologue, such tasks are solved beautifully and at ease using the DCG technique (Definite clause grammar: 1 , 2 , 3 , 4 ). In fact, this is a description of grammars in the Prolog language, and parsing based on them, based on the exhaustive search principle of the prolog.

    Well, that is, it usually turns out very briefly and beautifully (for example, here is a solution to the problem of balancing the brackets with this method: a program of 7 lines ), but I suspected that it was not always fast. Actually, I wanted to check it out.

    Pretty quickly (about 15 minutes) an obvious (naive) version of the program was written:


    :- set_prolog_flag(float_format,'%.15g').

    integer(I) -->
            digit(D0),
            digits(D),
            { number_chars(I, [D0|D])
            }.

    digits([D|T]) -->
            digit(D), !,
            digits(T).
    digits([]) -->
            [].

    digit(D) -->
            [D],
            { code_type(D, digit)
            }.


    float(F) -->
        ( "-", {Sign= -1}
        ; "", {Sign= 1}
        ), !,
        integer(N),
        ",",
        integer(D),
        {FisSign * (N + D / 10^(ceiling(log10(D))))
        }.


    sum(S) -->
        float(F1), !,
        " ",
        ( sum(S1)
        ; {S1= 0}
        ),
        { SisF1 + S1}.


    go :-
        read_file_to_codes('numbers_large.txt',Codes,[]),
        writeln('Processing...'),
        sum(S,Codes,[]),
        writeln(S).


    Of course, I suspected a breakdown on the parsing of such a large file, but a joke happened earlier. The read_file_to_codes predicate fell into the Out of global stack. This, in general, was pretty obvious. The fact is that in my implementation of the SWI prologue there is a fundamental limitation on the 32-bit architecture - it cannot use more than 128 mb per global stack and 128 mb per local one (the restrictions have been removed in 64 bit axes). I later read it in the dock, and before that I tried to play with the -G500M-L500M keys, but this, of course, did not save, because reading ~ 82M characters in memory required significantly more space (than 128).

    The guys from the mailing list helped (by the way, I want to emphasize that the author of SWI Prolog, Jan Wielemaker, is actively involved in the mailing list, and this time he helped!).

    They suggested using the phrase_from_file predicate from library (pio), which works with lazy-style streams and actually combines reading from a file and parsing in a single predicate. Just what you need!

    Moreover, anticipating further problems, he rewrote the predicate sum, which performs parsing with simultaneous summation, in the 'tail recursion' form. For the uninitiated, I’ll tell you a little more in detail: in functional and logical programming languages ​​you usually can’t create loops, so almost everything is done using recursion. But if you write it correctly (when possible), then you can achieve the effect that the code is compiled into a loop, although outwardly it will be recursion. On the prolog for writing tail recursion, the predicate must be written as

    head :-
      , !,
      ,
      head.
    head :-
      ...


    Actually, the main thing in tail recursion is that at the very end of the predicate there is a call of its own.

    The code was rewritten as follows:

    :- set_prolog_flag(float_format,'%.15g').

    integer(I) -->
            digit(D0),
            digits(D),
            { number_chars(I, [D0|D])
            }.

    digits([D|T]) -->
            digit(D), !,
            digits(T).
    digits([]) -->
            [].

    digit(D) -->
            [D],
            { code_type(D, digit)
            }.


    float(F) -->
        ( "-", {Sign=3D -1}
        ; "", {Sign=3D 1}
        ),
        integer(N),
        ",",
        integer(D),
        {FisSign * (N + D / 10^(ceiling(log10(D))))
        }.

    sum(S, Total) -->
        float(F1), !,
        " ",
        { S1isS + F1},
        sum(S1, Total),
        !.
    sum(Total, Total) -->
        [].

    go1 :-
        phrase_from_file(sum(0, S),'numbers_large.txt', [buffer_size(16384)]),
        writeln(S).


    However, a bummer again, this time with local stack (and memory consumption of about 130 mb). This could only mean that tail recursion did not work, and this time it turned out to be guilty prudently (here, indeed, premature optimization is the root of all ills) put at the end of the predicate sum / 2 clipping icons (cut,!). It was he who, called to cut off the search for unnecessary alternatives, being put at the end, turned off the optimization of tail recursion as well. In general, removing this ill-fated (!) From there, but adding it to the float after an alternative to the parsing of the sign (in order to avoid remembering the return points), I got the final version of the program:

    ?- go1.
    ERROR: Out of local stack
    Exception: (1,973,549) sum(3943621517.84343, _G2, [45, 50, 55, 54, 57,
    44, 49, 48|...], []) ?







    :- set_prolog_flag(float_format,'%.15g').

    integer(I) -->
            digit(D0),
            digits(D),
            { number_chars(I, [D0|D])
            }.

    digits([D|T]) -->
            digit(D), !,
            digits(T).
    digits([]) -->
            [].

    digit(D) -->
            [D],
            { code_type(D, digit)
            }.


    float(F) -->
        ( "-", {Sign= -1}
        ; "", {Sign= 1}
        ), !,
        integer(N),
        ",",
        integer(D),
        {FisSign * (N + D / 10^(ceiling(log10(D))))
        }.

    sum(S, Total) -->
        float(F1), !,
        " ",
        { S1isS + F1},
        sum(S1, Total).
    sum(Total, Total) -->
        [].

    go1 :-
        phrase_from_file(sum(0, S),'numbers_large.txt', [buffer_size(16384)]),
        writeln(S).


    On Core2Duo @ 3Ghz, ~ 1min works, consuming no more than 5 mb of memory and developing a parsing speed of ~ 1.3 Mb / s. Well, it’s not so fast as an option on Haskell, but, IMHO, it’s very good for the interpreted SWI-Prolog, and it’s very elegant, compared to the solution from the root post of RSDN).

    By the way, python was not surprisingly pleased with the speed. Naive, true, option
    from decimal import *

    getcontext().prec=15

    print sum(Decimal(f.replace(',','.') or '0') for f in open('numbers_large.txt').read().split(' '))

    * This source code was highlighted with Source Code Highlighter.


    I ate 236 megabytes of memory and worked for ~ 4 min on the same configuration.

    Also popular now: