dkuzminov February 17, 2014 at 15:12

From Bubble Sort to Genetic Algorithms

From the sandbox

This article is a brief overview of what genetic algorithms are. Being a beginner in bioinformatics, I’ll start with things that are close and understandable to people in technical specialties: sorting algorithms, then I will go on to describe one of the problems of genetic programming and what biologists understand by algorithms.

Sortings

I think that every programmer implicitly believes that he knows about this subject, if not all, then almost everything. Maybe at the moment it is so, and the reader of these lines has studied the third volume of “The Art of Programming” by Donald Knuth. But let's do an experiment: take your favorite algorithm and try to implement it with a pencil on a piece of paper. If there is no favorite algorithm, then I suggest sorting by bubble. Happened? I have repeatedly been offered this task at interviews, but few of the proposed options turned out to be workers, and only a handful turned out to be a “bubble”. Recently I happened to see a lecture notes on Python, where the teacher, under the guise of sorting a bubble, gives the following code:

def bubble_sort(numbers):
    for first in range(0, len(numbers)-1):
        for second in range(first+1, len(numbers)):
            if numbers[first] > numbers[second]:
                numbers[first], numbers[second] = numbers[second], numbers[first]

The algorithm is working, it will be sorted, but only this is clearly not a “bubble”. With a slight stretch, it can be called sorting by selection, but there is an unfortunate oversight here: the advantage of sorting by selection is to minimize the number of replacements, that is, the number of times when we swap elements should be limited to O (n), which is not observed in the code above.

In a conversation about algorithms, it can be interesting to turn the question of quick sorting over. As a rule, people easily answer the question of why it is good: this is a good algorithmic complexity on average, and the lack of need for additional memory (they usually forget about the stack). But then it is interesting to ask, why is it bad, since, besides it, there are many other algorithms? The answers to this question are much more scarce and monotonous. And what are the disadvantages of Quick Sort you can name?

To my pleasure, sometimes it is possible to meet aesthetes demonstrating knowledge of exotic methods, for example, gnome sort.
Conclusion:
1. If you want to pass for an esthete at the interview, study a couple of little-known sorting methods;
2. If you yourself accept interviews, be able to distinguish different methods from each other, otherwise you never know the candidate, the following paragraph 1 ...

Pancake Sort

Following the tips above, we consider a task that does not have practical meaning, but is interesting from a theoretical point of view: pancake sorting. Its impracticality lies in the fact that in this algorithm the number of comparisons of elements is not taken into account at all (it is believed that these operations are very cheap and fast), and the only operation that has a price is to flip the top of the stack of sorted “pancakes”. Of course, initially they are arranged in random order, and the desired result is a kind of tower of Hanoi, when pancakes with a larger diameter lie below, and smaller pancakes are located on top.

Regarding this problem, I would like to note two interesting facts. First, the method of finding a sufficiently effective (albeit non-optimal) solution was proposed at one time by the notorious Bill Gates, while he was still a student. This algorithm, proposed in 1979, remained the most efficient until 2008, when the result was surpassed. Secondly, as was proved later, the problem of finding the optimal solution belongs to the class of NP-complete ones. Gates and his leader Christos Papadimitriou also proposed a complicated version of the problem, known as the burned pancakes problem.

The problem of burnt pancakes

In this form of the problem, each element, in addition to size, also has a binary attribute “direction”, that is, for each “pancake” one side is burnt and the other is rosy; the result of solving the problem is a stack ordered by size, but, in addition, all the “pancakes” are burnt side down. Without going into details, we say that this problem is NP-complete, and that in the general case its sufficiently effective but non-optimal solutions are known. However, there are restrictions on the data, upon completion of which the task becomes polynomial. For this problem, such data are simple permutations . To understand what it is, consider a permutation of numbers from 1 to 7: 2 647513. Note that the sequence in bold is itself a permutation of numbers from 4 to 7 (called a block ). Simple permutations are those where there are no nontrivial blocks (blocks of length 1 and n).

Behind the figurative formulation of the problem, when the elements are represented by burnt pancakes, it is difficult to discern the fact that it has biological significance. However, a common mutation is a flip in a DNA molecule, and if you calculate the minimum number of flips needed to convert one molecule to another (without considering smaller point mutations), then you can roughly estimate the affinity of organisms. For example, the human and mouse genome is separated by only about 120 global mutations; I admit, I used to believe that there is much more difference between these types of ...

Genetic algorithm for solving the problem of burnt pancakes

In comparative genomics, the algorithm is used to analytically find the number of mutations that separate organisms, but let's look at the problem in a different projection. Now we will declare the solution of the analytical problem as the goal, and we will make living organisms and processes in them an instrument for its solution.

As you know, bacteria are able to divide, providing exponential population growth, if they are provided with the necessary conditions; Of course, after a while the bacteria colonies will no longer have a nutrient medium, other factors will also appear that influence the growth of the colony, but this takes time. Experimentally, we can find out the average time it takes for the bacteria of this species to appear on a global mutation associated with the inversion of a part of DNA.

Now let's set the task for bacteria. We genetically modify one of them (biologists like to use Escherichia coli in such experiments), where we break the antibiotic resistance gene into several parts and mix them together, changing not only the order, but also the direction of some pieces. Thus, each inverted and rearranged piece of the gene in our case will be a “burnt pancake”.

The task of the bacterium is posed, you can start the experiment. We place it in a nutrient medium and wait for the allotted time for which the appearance of one DNA mutation-revolution is expected. I draw attention to what we consider one mutation: it is not unique to the entire colony (there will be many of these mutations in the colony); this is just the expected number of upheavals in each of the living bacteria compared to their progenitor.

Is one pancake flip enough to solve a problem? We can easily verify this by placing part of the colony in a hazardous environment. The bacteria placed in the antibiotic did not survive, and we continue to monitor the remaining ones. Two or three more periods will pass, and finally, in the group of bacteria placed in the antibiotic, they will remain alive. The "collective mind" coped with the task!

Summary

Of course, the usefulness of the solved problem is unlikely to impress us: in the actual experiments conducted, the number of sorted “pancakes” does not exceed four, and the number of mutations occurring in the allotted time can only be estimated probabilistically. And yet, personally I was struck by the imagination of those biologists who were able to set up an experiment; those who were able to solve the traveling salesman problem by the biological method also possessed no less imagination (I will leave the details of this experiment outside the scope of this article). In many ways, the complexity of the tasks solved by genetic algorithms is comparable to quantum computing, and I want to believe that both areas of non-classical computing can give results that are not yet achievable in modern conditions.

Tags: