Automatic program generation, inverse problem and some related solutions

    Hello, dear readers. This article will discuss one approach to automatically generating programs using the block model of the problem, solving an inverse problem (restoring the model of the original problem using an already generated program), and also solving the problem of verifying generated programs. The topic itself is very serious, but if possible, I will try to make the article popular (without a heavy review of analogs, a strictly theoretical part and other difficulties), with examples and descriptions of various applications.

    1. Automatic program generation


    To generate any program, you first need to know what problem we are solving. Having a clear formalized formulation of the problem, it is already possible to build certain rules for expanding this formulation into the problem solving plan and the rules for transforming the solution plan into the final code in a generalized or concrete algorithmic language. In this case, an approach is usually used to constructing the final program from ready-made, pre-created blocks — typical algorithms that are associated with a certain calling / serving code.

    Let the initial formulation of the problem to be solved be represented as a block diagram (blocks can be elementary or, in turn, be subcircuits), for example, networks of objects, each of which belongs to a certain class from the hierarchy of concept classes of the subject domain. Such a statement will be quite clear and understandable both to the person and to the program generating system. The system should convert this formulation into a plan for solving the problem according to some logical rules. Therefore, it would be appropriate to write the rules for such a transformation in any logical programming language, I chose GNU Prolog. In practice, I note, you can often "skip" this first phase (high-level description of the problem), immediately constructing a description of the solution plan, also in the form of a block network scheme.

    The resulting solution plan should be transformed into a program, as I already wrote, which is a code that connects typical algorithms for solving "building blocks" tasks, implemented, for example, as a library of functions. There are various ways possible, I will describe in a few words the approach I used. The generation process is represented as a sequence of events (for example, generation of declarations, generation of the initialization part, stages of the main part, deinitialization), at each of which the network of objects must be cascade (referring to the network cascades, such a scheme eliminates looping the model during event handling) corresponding code snippets. In this case, objects are based on the event identifier, the values ​​of their parameters (set by the user when constructing the block statement of the problem), as well as on the data which they can transmit to each other through the connections between them. Therefore, each object from the solution plan has methods for responding to such events, which generate code (and, in general, arbitrary text). To solve this problem, I chose the language of PHP - it can perfectly generate arbitrary text fragments.

    Setting up the system on the subject area is done by developing the corresponding hierarchy of the generating classes.

    An example of a generating script that generates code for the class “Vector processing (minimum, maximum, arithmetic average)”:

    <?if ($Stage == stInit) {
          $argumentID = $Arg["ID"][0];
          $argumentSize = $Arg["Size"][0];
          $resultID = GetNextMail("result" . $this->ID);
          if ($this->Op == "Min") {
             echo"  " . $resultID . " = 1E300;\n";
          } elseif ($this->Op == "Max") {
             echo"  " . $resultID . " = -1E300;\n";
          } elseif ($this->Op == "Avr") {
             echo"  " . $resultID . " = 0.0;\n";
          }
    ?>for (i = 0; i < <?echo $argumentSize; ?>; i++)
    <?if ($this->Op == "Min") {
    ?>if (<?echo $argumentID . "[i] < " . $resultID; ?>)
           <?echo $resultID . " = " . $argumentID . "[i];\n";
          } elseif ($this->Op == "Max") {
    ?>if (<?echo $argumentID . "[i] > " . $resultID; ?>)
           <?echo $resultID . " = " . $argumentID . "[i];\n";
          } elseif ($this->Op == "Avr") {
             echo"    " . $resultID . " += " . $argumentID . "[i];\n";
          }
          if ($this->Op == "Avr") {
             echo"  " . $resultID . " /= " . $argumentSize . ";\n";
          }
       }
    ?>

    This is a relatively simple scheme that works. A system that implements such a generation scheme (PGEN ++) allows, for example, to generate solvers in the following areas:

    a) modeling the propagation of pollution;
    b) solving some problems of kinematics of simple robot manipulators;
    c) simple training programs for working with vector data (search for minimum, maximum, arithmetic average);
    d) development of tests for the system of psychological testing PROFTEST.

    Here is an example of the initial formulation of the problem (a simple vector data processing program) in the form of a block diagram:



    And this is the result of generating the program :



    2. The inverse problem: reconstruction of the problem statement (initial model) by the generated program. Verification generated programs


    First of all, why is a solution of such a task necessary? Its direct, and, I think, the most obvious application is the verification of automatically generated programs. If we had model A, built a program P for it, from which model B was reconstructed, then the program is most likely correct if models A and B coincide.

    The following simple solution is proposed. In the automatic generation of the program, generating objects related to a certain hierarchy of classes of the subject area appeared. They had only generating methods. Add to them the recognition methods. Then the source program (or, in general, any text) is sequentially scanned by the recognition methods of each class, which, upon successful recognition, generates an object / objects of this class. We get an ordered (in the order of “reading” the text) a set of parameterized model elements. It remains to determine the relationship between them based on the order of objects and the relationship between their parameters. To do this, you can apply special decision rules.

    The recognition method, in my implementation, consists of a recognition part (this is a group of modified regular expressions) and a crucial part (written in GNU Prolog). Regular expressions are modified in such a way as to carry out some preprocessing (dictionary checking, fuzzy checking for Levenshteín similarity, checking for balance of expressions in parentheses) even at the parsing stage, for this they added the ability to include various chains (checking, adding, removing, learning neural network) "fast" predicates.

    This scheme also works. As a direct application, it was used to reconstruct simple training programs for working with vector data (search for minimum, maximum, arithmetic average), in which case it quite successfully demonstrated possible verification of generated programs by comparing the original and reconstructed models. But there are other uses, about them further.

    3. Natural-language interface for system generating programs


    When solving the inverse problem (described above), there are no restrictions on the type of raw materials for restoring the model of the problem being solved. It may well be a natural language text. You just need to write the appropriate recognition methods. Therefore, the implementation of a natural-language interface to the generation system can be an unexpected application of the inverse problem:

    a) the formulation of the problem is written in simplified natural language;
    b) the inverse problem is solved - a formal formulation is extracted from the natural language formulation (a network of object objects of the domain);
    c) the system is launched to produce a program according to the formal formulation obtained.

    And this scheme also works. The example is developed for the same case of simple training programs for working with vector data.

    Here is an example of a recognizing method (class “Entering a Vector or a Scalar from a Keyboard”), which is divided into two versions (recognition of a program (Programmatical mode) or recognition of a fragment of a problem statement in Russian (Russian mode)). From above there is a recognizing part, further decisive predicates on the GNU Prolog.

    @versions(Programmatical)
     @global_unique(PROCESS,infinity):-
      (\s*for\s*\(i\s*\=\s*0\;\s*i\s*\<\s*(\d+)->{N}\;\s*i\+\+\)\s*\{\\n\s*printf\(\"(\w+)->{VECTOR}
      \[\%i\]\s*\=\s*\"\,\s*i\)\;\\n\s*scanf\(\"\%lf\"\,\s*\&(\w+)==>{VECTOR}\[i\]\)\;\\n\s*\}\\n)|
      (\s*printf\(\"(\w+)->{SCALAR}\s*\=\s*\"\)\;\\n\s*scanf\(\"\%lf\"\,\s*\&(\w+)==>{SCALAR}\)\;\\n).
    @versions(Russian)
     @nearest(db_vvedem,2,"db_vvedem.csv").
     @fast(db_var,"db_var.csv").
     @global_unique(PROCESS,infinity):-
      (()->{VAR}([А-Яа-я]+)?=>{db_vvedem($)}\s+((с\s+клавиатуры\s+)?)->{KEYB}
       ([А-Яа-я]+)?=>{db_var($,$VAR)}\s+(\w+)->{ID}((\s\s+клавиатуры)?)!=>{KEYB}\s*\.).
    @versions(Russian)
     handle:-xpath('PROCESS','//ID/text()',[VText]),
      xpath('PROCESS','//VAR/text()',['0']),
      simple_vector(_,VText,_),
      global_id(GID),assertz(simple_act(GID,in,VText,'')),!.
     handle:-xpath('PROCESS','//ID/text()',[SText]),
      xpath('PROCESS','//VAR/text()',['1']),
      global_id(GID),assertz(simple_act(GID,in,SText,'')),!.
    @versions(Programmatical)
     handle:-xpath('PROCESS','//VECTOR/text()',[VText]),
      xpath('PROCESS','//N/text()',[NText]),
      simple_vector(_,VText,NText),
      global_id(GID),assertz(simple_act(GID,in,VText,'')),!.
     handle:-xpath('PROCESS','//SCALAR/text()',[SText]),
      simple_scalar(_,SText),
      global_id(GID),assertz(simple_act(GID,in,SText,'')),!.
    @versions(Programmatical,Russian)
     @goal:-
      handle,!.
     @done:-
      clear_db.
    

    An example of a problem statement in Russian:
    Составить программу. Ввести скаляр max. Ввести скаляр min.
    Введем вектор V из 10 элементов. Зададим вектор V с клавиатуры. Найдем минимум вектора V и поместим результат в скаляр min.
    Найдем также максимум вектора V и поместим результат в скаляр max. Вывести скаляр min на экран. Вывести скаляр max на экран.
    Вывести вектор V на экран.
    А среднее арифметическое считать не будем.
    

    And the task model obtained by the system from the above natural language description :


    4. Other applications of the inverse problem


    When solving the inverse problem, nothing prevents us from considering the case of not just recognizing a certain program, but recognizing its case “with improvement” or another (arbitrary character) processing. In particular, it was possible to develop an “automatic parallelizer” of a recognizable C program. It performs static and, in part, dynamic analysis of the recognized code and inserts parallelization directives from the Cilk / Cilk ++ extension into it. Such an improvement task is accomplished by discriminating methods (on GNU Prolog).

    An example of a computational C program processed by the parallelizer (it inserted the directives cilk_spawn and cilk_sync):

    #include<stdio.h>#include<math.h>#include<omp.h>#define PI 3.14159265358#define NX 100#define NY 100#define MaxT 0.001#define h 0.01#define tau 0.00001#define cilk_spawn _Cilk_spawn#define cilk_sync _Cilk_syncvoidF( double x, double y, double t, double * val ){
      double r = sqrt(x*x + y*y);
      double result = 0.0;
      int i;
      for ( i = 0; i < 300; i++ )
        result += exp(-r*t)*sin(0.1*i*r + 0.5*PI);
      *val = result;
    }
    intmain(  ){
      double f[NY][NX] = {0.0};
      double v[NY][NX] = {0.0};
      double v1[NY][NX] = {0.0};
      double t;
      int x, y;
      double start = omp_get_wtime();
      for ( t = 0.0; t < MaxT; t += tau )
        {
          for ( y = 1; y < NY-1; y++ )
            for ( x = 1; x < NX-1; x++ )
              cilk_spawn F(x*h, y*h, t, &f[y][x]);
          for ( y = 1; y < NY-1; y++ )
            for ( x = 1; x < NX-1; x++ )
              {
                cilk_sync;
                v1[y][x] = v[y][x] + tau*((v[y-1][x]+v[y+1][x]+v[y][x-1]+v[y][x+1]-4.0*v[y][x])/h/h - f[y][x]);
              }
          for ( y = 1; y < NY-1; y++ )
            for ( x = 1; x < NX-1; x++ )
              v[y][x] = v1[y][x];
        }
      for ( y = 0; y < NY; y++ )
        {
          for ( x = 0; x < NX; x++ )
            printf("%lf ", v[y][x]);
          printf("\n");
        }
      printf("Total time = %lf\n", omp_get_wtime() - start);
      return0;
    }
    

    5. A little fiction. Verification and identification of arbitrary closed source programs


    Here we are talking about arbitrary programs written by a programmer and / or a system of generating programs for which there is simply no source code. Suppose you want to check the correctness of the functioning of such a program, or even to even understand what it does. In this case, one or another version of the so-called “metasloy”, a hypothetical component of the operating system, which tracks the entire sequence of the program’s operation (function call, data modification in memory, etc.) and builds an approximate model of its logic form equivalent to the functioning of the program in any programming language. After that, it remains to solve the inverse problem for such a program - to restore a possible initial model (or models) that would allow at least to assess the correctness or understand the purpose of the program. The prototype of such a “metasloy” was once developed by the author for the case of C / C ++ programs, not so much was possible, but something worked. Maybe someday someone will want to do this work in full volume.

    6. Conclusion


    I hope that I was able to demonstrate that the automatic generation of programs and the solution of the inverse problem are not purely academic problems, but something useful and having direct practical consequences. At the same time, the solutions presented here do not claim to be complete and obvious, but they are implemented and pretend to a certain novelty.

    Also popular now: