Character regression
When solving problems using machine learning methods, as a rule, we choose the most suitable algorithm in the context of the task, as well as a way to configure its parameters.
Let's look at a slightly different approach: instead of independently selecting an algorithm, we will develop a program that is able to automatically generate algorithms for solving problems.
A stochastic method for constructing programs based on the use of a genetic algorithm is called genetic programming.
The genetic algorithm is an optimization method based on the idea of natural selection as a means of achieving the best result. The generalization of the wording makes it possible to use this algorithm for solving various problems. All you need is:
Of course, the genetic algorithm has certain drawbacks associated with its stochastic nature, but based on my own experience , I can conclude that it finds acceptable solutions in terms of resource-time constraints.
I propose to consider in more detail the concept of genetic programming on the example of symbolic regression - the task of finding a formula that describes a certain dependence. More formally, it can be formulated as the construction of a regression model in the form of a superposition of specified functions .
A fairly convenient option for presenting the program is the syntax tree. In this case, the role of “programs” is played by algebraic formulas. Leaf nodes correspond to variable or numeric constants, while non-leaf nodes contain an operation that is performed on child nodes.
Example syntax tree:

It is worth noting that for each syntax tree there are an infinite number of semantically equivalent trees, for example:

We will use syntax trees as chromosomes.
Crossbreeding can be implemented by exchanging randomly selected subtrees:

Mutations can come up with many different options, but in the end, in my implementation, I settled on the following:
The fitness function of the syntax tree calculates how well the corresponding algebraic formula approximates the data, and, in fact, is the sum of the squared deviations of the values of the generated function from the training data.
The initial population is a collection of randomly generated syntax trees.
In general, the described operations of crossing and mutation should be sufficient to find a solution. But, based on my experience, I can say that in this case, a more or less acceptable solution will be found for a long time. Therefore, I will describe several optimizations that speed up the search for solutions.
Let's try to find a formula that will satisfy the following pattern:
At 111 iterations, we obtain the formula:

Which, after transformations, turns into a formula, which almost coincides with the desired one:

The dynamics of the formation of the solution are shown below:

It is interesting to follow the steps of forming a solution:
Sometimes you get formulas that are not so obvious. For example, once, when trying to get a formula,
I got the following result:

But if you use the identity:

You can show that the resulting formula is correct:

You can find the source code here: github.com/lagodiuk/genetic-programming
On a fairly limited number of formulas that I managed to verify, good results were obtained, there were still successful experiments in finding primitives and derivatives.
Also, genetic programming can be quite successfully used for the automatic construction of neural networks ( here is a small demo of the application in which I implemented this approach ).
There is still an idea - to apply a similar approach (together with QSAR techniques) for the automatic construction of structures of chemical compounds with desired properties.
On the Web you can find a lot of materials on the subject of genetic programming.
Separately, I want to note the following sources:
Let's look at a slightly different approach: instead of independently selecting an algorithm, we will develop a program that is able to automatically generate algorithms for solving problems.
A stochastic method for constructing programs based on the use of a genetic algorithm is called genetic programming.
A few words about the genetic algorithm
The genetic algorithm is an optimization method based on the idea of natural selection as a means of achieving the best result. The generalization of the wording makes it possible to use this algorithm for solving various problems. All you need is:
- come up with how the solution to the problem can be encoded in the form of chromosomes that have the ability to mutate and cross
- determine the fitness function, which calculates how optimal is the solution encoded by a particular chromosome
- having the initial population of chromosomes and fitness function, using the genetic algorithm, you can find chromosomes that correspond to the optimal solutions to the problem
Of course, the genetic algorithm has certain drawbacks associated with its stochastic nature, but based on my own experience , I can conclude that it finds acceptable solutions in terms of resource-time constraints.
Genetic programming and symbolic regression
I propose to consider in more detail the concept of genetic programming on the example of symbolic regression - the task of finding a formula that describes a certain dependence. More formally, it can be formulated as the construction of a regression model in the form of a superposition of specified functions .
A fairly convenient option for presenting the program is the syntax tree. In this case, the role of “programs” is played by algebraic formulas. Leaf nodes correspond to variable or numeric constants, while non-leaf nodes contain an operation that is performed on child nodes.
Example syntax tree:

It is worth noting that for each syntax tree there are an infinite number of semantically equivalent trees, for example:

We will use syntax trees as chromosomes.
Genetic operations on syntax trees
Crossbreeding can be implemented by exchanging randomly selected subtrees:

Mutations can come up with many different options, but in the end, in my implementation, I settled on the following:
- Replacing a function with a randomly selected other:
- Replacing a subtree with a randomly generated subtree:
- Removing an intermediate non-leaf node:
- Creating a random node that is assigned by the new root of the tree:
- For nodes with non-commutative operations, rearrange the child subtrees:
The fitness function of the syntax tree calculates how well the corresponding algebraic formula approximates the data, and, in fact, is the sum of the squared deviations of the values of the generated function from the training data.
The initial population is a collection of randomly generated syntax trees.
Optimizations
In general, the described operations of crossing and mutation should be sufficient to find a solution. But, based on my experience, I can say that in this case, a more or less acceptable solution will be found for a long time. Therefore, I will describe several optimizations that speed up the search for solutions.
- Let's look at two formulas created in the search for solutions (red crosses are training data):
It is easy to see that the blue line is quite close to the training data, unlike the green parabola, which means it has a smaller sum of squared deviations relative to the training sample .
But, on the other hand, the naked eye can see that parabola is just what we need.
It's all about the ratios. As you may have already guessed, the coefficients of each tree are optimized. And to optimize the coefficients, of course, a genetic algorithm is used!
In this case, the chromosome is an array of numbers: - Since we work only with real numbers, it is convenient to operate with functions in which the range of values is a subset of the definition domain, for example:
+, -, sin (x), cos (x), ln (abs (x) +0.00001), sqrt (abs (x)), etc.
Thus, as a result of genetic modifications, we will be guaranteed to receive syntax trees that do not contain errors. - There are combinations in which a subtree contains leaf nodes with only numeric constants. You can calculate its value and simplify the syntax tree:
- Another optimization is to limit the maximum depth of the tree (of course, this imposes its limitations on the solution search space, but the need for this optimization is dictated by practical considerations):
- To avoid the situation when all the trees in the population are similar to each other (the genetic algorithm is stuck at a local minimum) - there is a non-zero probability of completely replacing one of the trees with a randomly generated one.
Enough theory, let's see how it works in practice
Let's try to find a formula that will satisfy the following pattern:
x | y | z | f (x, y, z) |
26 | 35 | 1 | 830 |
8 | 24 | -eleven | 130 |
20 | 1 | 10 | 477 |
33 | eleven | 2 | 1217 |
37 | 16 | 7 | 1524 |
The desired formula

At 111 iterations, we obtain the formula:

Which, after transformations, turns into a formula, which almost coincides with the desired one:

The dynamics of the formation of the solution are shown below:

View the entire log of calculation results
Population No. | Sum of squared deviations | Best formula |
---|---|---|
0 | 1669012.26766007 | f (x, y, z) = 544.6859000804278 |
1 | 169681.53834698 | f (x, y, z) = (x * (28.961135980737247 + z)) |
2 | 148288.46850717001 | f (x, y, z) = (x * (31.37500136808971 + z)) |
3 | 132767.659539771 | f (x, y, z) = (x * (52.43484580560726 + (-16.667350145606648))) |
4 | 2985.09234959275 | f (x, y, z) = (x * (4,433746792798393 + x)) |
5 | 2983.6495099344102 | f (x, y, z) = (x * (4,4641060233894825 + x)) |
6 | 2983.6495099344102 | |
7 | 2983.3906931086399 | f (x, y, z) = (x * (4.454265913806434 + x)) |
8 | 2983.3906931086399 | |
9 | 2983.3904005863701 | f (x, y, z) = (x * (4.454298469272653 + x)) |
10 | 597.81897366597798 | f (x, y, z) = ((x * (3.844350383063788 + x)) + y) |
eleven | 594.94169424631298 | f (x, y, z) = ((x * (3.8969889609817003 + x)) + y) |
12 | 594.94169424631298 | |
thirteen | 593.73658852479002 | f (x, y, z) = ((x * (3,882218261498 + x)) + y) |
14 | 465.83708126862598 | f (x, y, z) = (((x * (4,005216664324722 + x)) + y) - z) |
fifteen | 465.83708126862598 | |
16 | 465.83708126862598 | |
17 | 465.83708126862598 | |
18 | 465.83015734565402 | f (x, y, z) = (((x * (4,005911869833458 + x)) + y) - z) |
19 | 465.83015734565402 | |
20 | 465.83015734565402 | |
21 | 415.16018053718898 | f (x, y, z) = (((x * (3,5125714437530116 + x)) + y) - (-11,692166173605955)) |
22 | 415.16018053718898 | |
23 | 395.52399773897002 | f (x, y, z) = (((x * (3.382514048684854 + x)) + y) - (-14.647402701410202)) |
24 | 395.07066142434297 | f (x, y, z) = (((x * (3.3745415764367164 + x)) + y) - (-14.718709944856545)) |
25 | 394.26327346841998 | f (x, y, z) = (((x * (3.3648511983617673 + x)) + y) - (-15,239602399729787)) |
26 | 392.88638158772301 | f (x, y, z) = (((x * (3.337209019532033 + x)) + y) - (-15.802043204356028)) |
27 | 392.32411284414297 | f (x, y, z) = (((x * (3.3064294470317237 + x)) + y) - (-16.587523294477112)) |
28 | 392.32411284414297 | |
29th | 392.32411284414297 | |
thirty | 201.31809082052899 | f (x, y, z) = ((((x * (3.1436791342095485 + x)) + y) - (-3.592869973372564)) + y) |
31 | 164.61977693577799 | f (x, y, z) = ((((x * (3.284782293612155 + x)) + y) - 0.2814995918777923) + y) |
32 | 149.279581721206 | f (x, y, z) = ((((x * (3.428687042834285 + x)) + y) - 3.8492278595932117) + y) |
33 | 149.278346415192 | f (x, y, z) = ((((x * (3.428687042834285 + x)) + y) - 3.8397689886934776) + y) |
34 | 149.278346415192 | |
35 | 148.94429071530399 | f (x, y, z) = ((((x * (3.4733961096640367 + x)) + y) - 4.871582132520638) + y) |
36 | 148.94429071530399 | |
37 | 148.92149057538799 | f (x, y, z) = ((((x * (3.4733961096640367 + x)) + y) - 4.927910435311452) + y) |
38 | 148.842647569717 | f (x, y, z) = ((((x * (3.4667603760518504 + x)) + y) - 4.761715096087028) + y) |
39 | 148.842126194348 | f (x, y, z) = ((((x * (3.4667603760518504 + x)) + y) - 4.766164660321495) + y) |
40 | 148.83482158482201 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.761715096087028) + y) |
41 | 148.83482158482201 | |
42 | 148.83482158482201 | |
43 | 148.83482158482201 | |
44 | 148.83482158482201 | |
45 | 148.824714357071 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.723957086860974) + y) |
46 | 148.824474980844 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.719858152218609) + y) |
47 | 148.824474980844 | |
48 | 148.824474980844 | |
49 | 148.824474980844 | |
fifty | 148.82441313535 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.717481429398491) + y) |
51 | 148.82441154759999 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.717364268937347) + y) |
52 | 148.82441154759999 | |
53 | 148.82441154759999 | |
54 | 148.82441154759999 | |
55 | 148.82441154759999 | |
56 | 148.82441154759999 | |
57 | 148.82441154759999 | |
58 | 148.824403242995 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.715925249764239) + y) |
59 | 148.824403242995 | |
60 | 148.824403242995 | |
61 | 148.824403242995 | |
62 | 148.824403242995 | |
63 | 148.824403242995 | |
64 | 148.824403242995 | |
65 | 148.824403242995 | |
66 | 148.824403242995 | |
67 | 148.824403242995 | |
68 | 148.824403242995 | |
69 | 148.824403242995 | |
70 | 148.824403242995 | |
71 | 148.824403242995 | |
72 | 148.824403242995 | |
73 | 148.824403242995 | |
74 | 148.824403242995 | |
75 | 148.824403242995 | |
76 | 148.824403242995 | |
77 | 148.824403242995 | |
78 | 148.824403242995 | |
79 | 148.824403242995 | |
80 | 148.824403242995 | |
81 | 148.824403242995 | |
82 | 148.824403242995 | |
83 | 148.824403242995 | |
84 | 148.824403242995 | |
85 | 148.824403242995 | |
86 | 148.824403242995 | |
87 | 148.824403242995 | |
88 | 148.824403242995 | |
89 | 148.824403242995 | |
90 | 148.824403242995 | |
91 | 148.824403242995 | |
92 | 148.824403242995 | |
93 | 148.824403242995 | |
94 | 148.824403242995 | |
95 | 148.824403242995 | |
96 | 148.824403242995 | |
97 | 148.824403242995 | |
98 | 148.824403242995 | |
99 | 148.824403242995 | |
100 | 109.738448314378 | f (x, y, z) = ((((x * (3.6304822654527666 + x)) + y) - ((y * 0.26890083188968594) - (-4,125304601214528))) + y) |
101 | 109.738448314378 | |
102 | 86.7213316804214 | f (x, y, z) = ((((x * (3.454146360987013 + x)) + y) - ((y * 0.26890083188968594) - 0.31706243101113074)) + y) |
103 | 86.077603952275396 | f (x, y, z) = ((((x * (3.4532441079663054 + x)) + y) - ((y * 0.2821822285682245) - 0.5330637639727107)) + y) |
104 | 84.787024859776594 | f (x, y, z) = ((((x * (3.418583927986026 + x)) + y) - ((y * 0.30109799837668216) - 1.6913597460963947)) + y) |
105 | 84.787024859776594 | |
106 | 19.436528477129301 | f (x, y, z) = ((((x * (3.1298089197766688 + x)) + y) - ((y * (-1.1430183282239135)) - (-0.7601011336523038))) + z) |
107 | 9.2899337994931397 | f (x, y, z) = ((((x * (3.1298089197766688 + x)) + y) - ((y * (-0.9847117571207198)) - 2.01456442176615)) + z) |
108 | 9.2899337994931397 | |
109 | 8.5880221046539802 | f (x, y, z) = ((((x * (3,1002105039045555 + x)) + y) - ((y * (-0.9847117571207198)) - 2.01456442176615)) + z) |
110 | 8.5880221046539802 | |
111 | 0.38510898634568402 | f (x, y, z) = ((((x * (3.0172293562767245 + x)) + y) - ((y * (-0.9842629202499209)) - 4.9120614148353544)) + z) |
It is interesting to follow the steps of forming a solution:
- first, there is a dependence on the variable x (this step is not shown in the figure, since the error value is still very large - approximately 130,000)
- then the formula transforms into a quadratic dependence on x (the sum of the squared deviations is ~ 3000)
- after which the dependence on the variable y is identified (the sum of the squared deviations ~ 500)
- and finally, after optimizing the numerical parameters and identifying the dependence on the variable z , an optimal result is achieved.
One more example
Sometimes you get formulas that are not so obvious. For example, once, when trying to get a formula,


But if you use the identity:

You can show that the resulting formula is correct:

If you want to experiment
Console Utility
- You must have the Java runtime installed
- Скачайте отсюда файл symbolic_regression_1.0.jar
- В том же каталоге, где находится symbolic_regression_1.0.jar, создайте файл (например data.txt) следующего содержания:
# allowed functions are: ADD SUB MUL DIV SQRT POW LN SIN COS
# set which functions to use:
ADD MUL SUB
# looking for:
f(x, y, z) — ?
# define training set:
f(26, 35, 1) = 830
f(8, 24, -11) = 130
f(20, 1, 10) = 477
f(33, 11, 2) = 1217
f(37, 16, 7) = 1524 - Start the search for a solution:
java -jar symbolic_regression_1.0.jar data.txt
A solution is considered found if the sum of the squares of the deviations is less than 10. If a solution has not yet been found, then after every 50 iterations you will be asked if you want to continue.
Java programmers
You can find the source code here: github.com/lagodiuk/genetic-programming
Instead of a conclusion
On a fairly limited number of formulas that I managed to verify, good results were obtained, there were still successful experiments in finding primitives and derivatives.
Also, genetic programming can be quite successfully used for the automatic construction of neural networks ( here is a small demo of the application in which I implemented this approach ).
There is still an idea - to apply a similar approach (together with QSAR techniques) for the automatic construction of structures of chemical compounds with desired properties.
Literature
On the Web you can find a lot of materials on the subject of genetic programming.
Separately, I want to note the following sources: