"Topological" sorting of a graph with cycles
The full title of the article was supposed to be “Sustainable“ topological ”sorting of a graph with cycles
Disclaimer: I'm a programmer, not a mathematician, so inaccurate language is possible in places, for which you can and should kick.
I will analyze the wording of the problem, the solution of which I want to share, in parts.
Topological sorting is the ordering of the vertices of a directed acyclic graph in which each of the vertices from which the edge comes out comes earlier than the vertex into which this edge enters. There are two important nuances here: a graph can have more than one such orderings and it is applicable only to acyclic graphs. Mathematicians do not care, but programmers sometimes want determinism and a little more than "I'm sorry, you have a cycle here, you will not have any sorting."
Therefore, we add the stability requirement: a pair of vertices, the order between which is not specified by the edges of the graph, should be determined by the order in which these vertices arrived at the input of the algorithm. As a result, repeated sorts will not change the order of the vertices.
With the lack of recursion, everything is simple, the computer is significantly weaker than the Turing machine and memory (and especially the stack) has limited. Therefore, in applied programming, usually iterative algorithms are preferable to recursive ones.
And finally, I’ll define what I call “topological” sorting in the case of loopsin the graph. This is the ordering of the vertices, which coincides with the true topological sorting, if each of the cycles is replaced by one vertex, and the vertices of the cycle itself, in accordance with the stability requirement, are located relative to each other in the original order.
And nowwith all this garbage we’ll try to take off. I’ll do it all within the framework of the time and memory restrictions indicated at the beginning of the post.
If you look at the existing algorithms for topological sorting ( Kahn algorithm , deep search ), it turns out that they all, if there is a cycle, say "I can not" and stop working.
Therefore, let’s go on the other hand, with algorithms that can do something intelligible with cycles. For example, find them. Among the algorithms listed in Wikipedia for finding cycles in graphs , attention was drawn to the Taryan algorithm . It contains an interesting remark that, as a by-product, the algorithm produces the inverse topological sorting of the graph:
So, we have a list of vertices at the input and (thanks to Pierce) a certain index of the component of strong connectivity to which this vertex at the output belongs. The next step is to stably sort the vertices according to the serial number of their component. There is an algorithm for such a task, it is called counting sorting , which performs this in
In the process of gathering the algorithm into a heap, it turned out that the fact that it is natural to give it the opposite topological sorting is really much coming out of Taryan - then the neighboring branches of the graph (not having an order relationship between them) will number backwards, then the pieces of the graph will not having any connections between themselves, turn out to be in the reverse order ...
So the final solution:
GitHub code, Java, public domain . It may be noted that in order to ensure the stability of sorting, the Pierce algorithm is slightly modified and bypasses the vertices in the reverse order.
And now the background, why all this was needed. When loading / unloading dynamic libraries (.so), glibc needs to decide in which order to initialize the static variables. Variables depend on each other, depend on different functions, etc. In general, all this forms the very graph in which there are cycles and which must be sorted.
Once upon a time, rather non-optimal code that performed the task for was engaged in this task
The harsh men from RedHat thought, thought and screwed up a few more cycles from above. Problem cases were fixed, but the algorithm began to work for
And time passed, glibc mercilessly slowed down and in 2015 there was another attempt to repair the algorithm . Unfortunately, unsuccessful, the algorithm was chosen
Today is the year 2019, glibc is still slowing down. Judging by how much time it took me to fix the problem, the chances that I will bring it to the end are significantly lower than 100%. This is further aggravated by the fact that things are happening in C, without IDE support, in GNU coding style code, crazy test runner (“if you want to run the test again, just delete the corresponding .out file”). And in order for glibc to take a look at your patch, you need to go through the copyright assignment procedure , properly issue the patch and the devil knows what else. Therefore, in order to at least remove the issue of inventing an algorithm that solves the problem, this post was written.
O(|V| + |e| log |e|)
in time and O(|V|)
memory without recursion,” but they told me that it was too much. Disclaimer: I'm a programmer, not a mathematician, so inaccurate language is possible in places, for which you can and should kick.
Essence of the task
I will analyze the wording of the problem, the solution of which I want to share, in parts.
Topological sorting is the ordering of the vertices of a directed acyclic graph in which each of the vertices from which the edge comes out comes earlier than the vertex into which this edge enters. There are two important nuances here: a graph can have more than one such orderings and it is applicable only to acyclic graphs. Mathematicians do not care, but programmers sometimes want determinism and a little more than "I'm sorry, you have a cycle here, you will not have any sorting."
Therefore, we add the stability requirement: a pair of vertices, the order between which is not specified by the edges of the graph, should be determined by the order in which these vertices arrived at the input of the algorithm. As a result, repeated sorts will not change the order of the vertices.
With the lack of recursion, everything is simple, the computer is significantly weaker than the Turing machine and memory (and especially the stack) has limited. Therefore, in applied programming, usually iterative algorithms are preferable to recursive ones.
And finally, I’ll define what I call “topological” sorting in the case of loopsin the graph. This is the ordering of the vertices, which coincides with the true topological sorting, if each of the cycles is replaced by one vertex, and the vertices of the cycle itself, in accordance with the stability requirement, are located relative to each other in the original order.
And now
Search for a solution
If you look at the existing algorithms for topological sorting ( Kahn algorithm , deep search ), it turns out that they all, if there is a cycle, say "I can not" and stop working.
Therefore, let’s go on the other hand, with algorithms that can do something intelligible with cycles. For example, find them. Among the algorithms listed in Wikipedia for finding cycles in graphs , attention was drawn to the Taryan algorithm . It contains an interesting remark that, as a by-product, the algorithm produces the inverse topological sorting of the graph:
While there is nothing special about the order of the nodes within each strongly connected component, one useful property of the algorithm is that no strongly connected component will be identified before any of its successors. Therefore, the order in which the strongly connected components are identified constitutes a reverse topological sort of the DAG formed by the strongly connected components .True, the algorithm is recursive and it is not clear what it has with stability, but this is clearly a movement in the right direction. A closer reading of Wikipedia reveals a reference to the article A space-efficient algorithm for finding strongly connected components , authored by comrade David Pierce, in which not only is there an imperative algorithm, it also has reduced memory requirements compared to the classic Tarjan's algorithm. The bonus is the implementation of the algorithm in Java . Must take!
Algorithm PEA_FIND_SCC3 (V, E) from Pierce's article
So, we have a list of vertices at the input and (thanks to Pierce) a certain index of the component of strong connectivity to which this vertex at the output belongs. The next step is to stably sort the vertices according to the serial number of their component. There is an algorithm for such a task, it is called counting sorting , which performs this in
O(n)
time. In the process of gathering the algorithm into a heap, it turned out that the fact that it is natural to give it the opposite topological sorting is really much coming out of Taryan - then the neighboring branches of the graph (not having an order relationship between them) will number backwards, then the pieces of the graph will not having any connections between themselves, turn out to be in the reverse order ...
Answer
So the final solution:
- We number the vertices of the original list.
O(|V|)
- We sort the edges of each vertex according to the number of the vertex to which the edge goes.
O(|e| log |e|)
- Using the Pierce algorithm, we find and number the components of strong connection.
O(|V|)
- Using sorting by counting, we sort the vertices based on the numbers of the strongly connected components that they received.
O(|V|)
GitHub code, Java, public domain . It may be noted that in order to ensure the stability of sorting, the Pierce algorithm is slightly modified and bypasses the vertices in the reverse order.
But why???
And now the background, why all this was needed. When loading / unloading dynamic libraries (.so), glibc needs to decide in which order to initialize the static variables. Variables depend on each other, depend on different functions, etc. In general, all this forms the very graph in which there are cycles and which must be sorted.
Once upon a time, rather non-optimal code that performed the task for was engaged in this task
O(n^2)
. And in general, this didn’t really bother anyone, until in 2012 it was discovered that the code was not working correctly and in some cases was mistaken. The harsh men from RedHat thought, thought and screwed up a few more cycles from above. Problem cases were fixed, but the algorithm began to work for
O(n^3)
, and this has already become noticeable and on some applications it began to take a few tens of seconds, which was a bug in 2013 . Also, the author of the bug discovered cases in which the algorithm c was O(n^3)
also mistaken . He suggested using the Taryan algorithm, though the patch with corrections has never been designed. And time passed, glibc mercilessly slowed down and in 2015 there was another attempt to repair the algorithm . Unfortunately, unsuccessful, the algorithm was chosen
O(n^2)
, besides confusing the branches of the graph, between which the order is not defined.Today is the year 2019, glibc is still slowing down. Judging by how much time it took me to fix the problem, the chances that I will bring it to the end are significantly lower than 100%. This is further aggravated by the fact that things are happening in C, without IDE support, in GNU coding style code, crazy test runner (“if you want to run the test again, just delete the corresponding .out file”). And in order for glibc to take a look at your patch, you need to go through the copyright assignment procedure , properly issue the patch and the devil knows what else. Therefore, in order to at least remove the issue of inventing an algorithm that solves the problem, this post was written.