PatientZero June 11, 2019 at 09:27

Broad Search Optimization: How to process a graph with 10 billion states

Transfer

A couple of months ago, I finally had to admit that I wasn’t smart enough to go through some levels of the Snakebird puzzle . The only way to regain some of the self-esteem was to write a solver. So I could pretend that creating a program to solve the puzzle is almost the same as solving it myself. The code for the resulting C ++ program is available on Github . The main part of the code considered in the article is implemented in search.h and compress.h . In this post, I will mainly talk about optimizing a breadth-first search that would require 50-100 GB of memory to fit in 4 GB.

Later I will write another post, which will describe the specifics of the game. In this post, you need to know that I could not find any good alternatives to brute force, because none of the usual tricks worked. The game has many states, because there are a lot of moving or pushed objects, and the shape of some of them is important, which can change over time. There was no suitable conservative heuristic for algorithms like A * to narrow the search space. The search graph was oriented and implicitly specified; therefore, simultaneous search in the forward and reverse directions was impossible. The only move could change the state in many unrelated ways, so nothing could come in handy like hashing Zobrist .

Rough estimates have shown that in the biggest puzzle, after eliminating all symmetrical positions, there will be about 10 billion states. Even after packing state descriptions with maximum density, the state size was 8-10 bytes. With 100 GB of memory, the task would be trivial, but not for my home machine with 16 GB of memory. And since Chrome needs 12 GB of them, my real memory supply is closer to 4 GB. Everything that will exceed this volume will have to be saved to disk (old and rusty hard drive).

How to fit 100 GB of data in 4 GB of RAM? Either a) the states need to be compressed to 1/20 of their original, already optimized size, or b) the algorithm should be able to effectively save states to disk and vice versa, or c) a combination of the two above methods, or d) I need to buy more RAM or rent a powerful virtual machine for several days. I did not consider option D, because it is too boring. Options A and B were excluded after the proof of concept using gzip: a fragment of a state description of 50 MB was compressed to only 35 MB. This is about 7 bytes per state, and my memory is about 0.4 bytes per state. That is, option B remained, even though the breadth-first search seemed rather inconvenient for storage on secondary drives.

Content

This is a fairly long post, so here is a brief overview of the following sections:

Breadth-first search in a textbook - what is the usual breadth-first search (BFS) wording, and why is it not suitable for saving parts of a state to disk?
BFS with sorting and merging - a change in the algorithm for efficient batch disposal of redundant data.
Compression - reducing the amount of memory used by a hundred times due to the combination of standard and native compression.
Oh-oh, I cheated! - in the first sections I kept silent about something: it’s not enough for us to just know where the solution is, but we need to understand exactly how to achieve it. In this section, we update the basic algorithm so that it transfers enough data to recreate the solution from the last state.
Sort + merge with multiple output - storing more states completely negates the benefits of compression. The sorting + merging algorithm needs to be changed so that it stores two sets of output data: one, well-compressed, is used during the search, and the other is used only to recreate the solution after finding the first one.
Swap - Swap on Linux is much worse than I thought.
Compression of new states before merging - until now, memory optimizations worked only with a lot of visited states. But it turned out that the list of new generated states is much larger than you might think. This section shows a diagram for a more efficient description of new states.
Saving space on parent states - explore the trade-offs between using CPU / memory to recreate the solution at the end.
What did not work or may not work - some ideas seemed promising, but as a result they had to be rolled back, while others, which were supposed to be research workers, intuitively seem to me inappropriate in this case.

Wide search “by textbook”

What does the breadth-first search look like and why should you not use a disk in it? Prior to this small project, I considered only options for wording “from textbooks,” for example, such:

def bfs(graph, start, end):
    visited = {start}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return True
        for kid in adjacent(node):
            if kid not in visited:
                visited.add(kid)
                todo.push_back(kid)
    return False

In the process of creating new candidate nodes by the program, each node is checked with a hash table of already visited nodes. If it is already in the hash table, then the node is ignored. Otherwise, it is added to the queue and to the hash table. Sometimes in implementations, the information “visited” is entered in the nodes, and not in a foreign table; but this is a risky optimization and it is completely impossible if the graph is implicitly specified.

Why is using a hash table problematic? Because hash tables tend to create a completely random memory access pattern. If they do not, then this is a bad hash function, and the hash table will most likely have poor performance due to collisions. This random access pattern can cause performance problems, even if the data fits in memory: access to a huge hash table is likely to cause cache misses and associative translation buffers (TLBs). But what if a significant portion of the data is on disk, and not in memory? The consequences will be catastrophic: something of the order of 10 ms per search operation.

With 10 billion unique states, only accessing the hash table will take us about four months to wait for disk I / O. This does not suit us; the task absolutely definitely needs to be converted so that the program can process large data packets in a single pass.

BFS with sorting and merging

If we wanted to integrate data access operations into packages as much as possible, then what would be the maximum achievable approximation? Since the program does not know which nodes to process in a layer of depth N + 1 until the layer N is completely processed, it seems obvious that it is necessary to deduplicate states at least once per depth.

If we are working with a whole layer at the same time, we can abandon hash tables and describe the set of visited and new states as some sorted streams (for example, as file streams, arrays, lists). We can trivially find the new set visited by combining the sets of flows and it is equally trivial to find the set todo using the difference of sets.

Two operations with sets can be combined so that they work in one pass with both threads. In fact, we look into both streams, process the smaller element, and then move forward along the stream from which the element was taken (or along both flows if the elements at the beginning are the same). In both cases, we add the item to the new visited set. Then we move forward along the stream of new states, and also add an element to the new todo set:

def bfs(graph, start, end):
    visited = Stream()
    todo = Stream()
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return True
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited)
    return False
# Merges sorted streams new and visited. Return a sorted stream of
# elements that were just present in new, and another sorted
# stream containing the elements that were present in either or
# both of new and visited.
def merge_sorted_streams(new, visited):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            if visited.peek() == new.peek():
                out_visited.add(visited.pop())
                new.pop()
            elif visited.peek() < new.peek():
                out_visited.add(visited.pop())
            elif visited.peek() > new.peek():
                out_todo.add(new.peek())
                out_visited.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek())
            out_visited.add(new.pop())
    return out_todo, out_visited

The data access pattern is now completely linear and predictable; there are no arbitrary accesses throughout the merger. Therefore, the delay in disk operations becomes not important to us, and the only thing that remains important is bandwidth.

What will the theoretical performance look like with a simplified distribution of data over 100 depth levels, each of which has 100 million states? The averaged state will be read and written 50 times. This gives 10 bytes / state * 5 billion states * 50 = 2.5 TB. My hard drive can supposedly read and write at an average speed of 100 MB / s, that is, on average I / O will take (2 * 2.5 TB) / (100 MB / s) = ~ 50k / s = ~ 13 hours . This is a couple of orders less than the previous result (four months)!

It is also worth noting that this simplified model does not take into account the size of the new generated states. Before the merge step, they must be stored in memory for sorting + deduplication. We will cover this in the sections below.

Compression

In the introduction, I said that in the initial experiments, state compression did not look promising, and the compression ratio was only 30%. But after making changes to the algorithm, the states became streamlined. They should be much easier to compress.

To test this theory, I used zstd with a puzzle of 14.6 million states, each state of which was 8 bytes in size. After sorting, they were compressed on average to 1.4 bytes per state. It seems like a serious step forward. It is not enough to run the entire program in memory, but it can reduce the disk I / O time to just a couple of hours.

Is it possible to somehow improve the result of the modern general-purpose compression algorithm if we know something about the data structure? You can be almost sure of it. A good example of this is the PNG format. Theoretically, compression is just a standard Deflate pass. But instead of compressing raw data, the image is first converted using PNG filters. The PNG filter is essentially a formula for predicting the value of a byte of raw data based on the value of the same byte in the previous line and / or the same byte of the previous pixel. For example, the “up” filter converts each byte by subtracting the values of the previous line from it when compressing, and performing the opposite operation when unpacking. Given the types of images for which PNG is used, the result will almost always consist of zeros or numbers close to zero. Deflate can compress such data much better than raw data.

Can this principle be applied to BFS state records? It seems like this should be possible. As with PNG, we have a constant line size, and we can expect the adjacent lines to be very similar. The first samples with the subtraction / addition filter, followed by zstd, led to an improvement in the compression ratio by another 40%: 0.87 bytes per state. Filtering operations are trivial, therefore, from the point of view of CPU consumption, they are practically “free”.

It was not clear to me whether any other improvements could be made, or whether this was a practical limit. In the image data, you can logically expect that the adjacent bytes of the same line will be similar. But in these states there is no such thing. But actually, slightly more sophisticated filters can still improve results. In the end, I came to this system:

Suppose we have adjacent rows R1 = [1, 2, 3, 4] and R2 = [1, 2, 6, 4]. When outputting R2, we compare each byte with the same byte of the previous line, and 0 will indicate a match, and 1 will indicate a mismatch: diff = [0, 0, 1, 0]. Then we pass this bitmap, encoded as VarInt, followed by only bytes that do not match the previous line. In this example, we get two bytes 0b00000100 6. By itself, this filter compresses the reference data to 2.2 bytes / state. But by combining the + zstd filter, we reduced the data size to 0.42 bytes / state. Or, to put it another way, this amounts to 3.36 bits per state, which is just a bit more than our approximate calculated indicators necessary to ensure that all data fits in RAM.

In practice, compression ratios improve because sorted sets become denser. When the search reaches the point where memory begins to cause problems, compression rates can get much better. The biggest problem is that in the end we get 4.6 billion visited states. After sorting, these states occupy 405 MB and are compressed according to the scheme presented above. This gives us 0.7 bits per state . In the end, compression and decompression take up about 25% of the CPU time of the program, but this is an excellent compromise for reducing memory consumption by a hundred times.

The above filter seems a bit costly due to the VarInt header on each line. It seems like it's easy to upgrade at the cost of low CPU costs or a slight increase in complexity. I tried several different options, transposing data in order by columns, or writing bit masks in larger blocks, etc. These options alone yielded much higher compression ratios, but did not perform as well when the filter output was compressed by zstd. And it was not some kind of zstd error, the results with gzip and bzip2 turned out to be similar. I don’t have any particularly ingenious theories about why this particular type of coding turned out to be much better in compression than other options.

Another mystery: the compression rate turned out to be much better when the data is sorted by little-endian, rather than big-endian. Initially, I thought it happened because in the little-endian sorting there are more leading zeros with the bit mask encoded by VarInt. But this difference persists even with filters that do not have such dependencies.

(There is a lot of research on compressing sorted sets of integers, because they are the basic building blocks of search engines. However, I did not find much information about compressing sorted records of constant length, and did not want to guess, presenting the data as integer values with arbitrary precision.)

Oh-oh, I cheated!

You may have noticed that the above BFS implementations in pseudo-code return only Boolean values - the solution is found / not found. This is not particularly useful. In most cases, we will need to create a list of the exact steps of the solution, and not just inform about the availability of the solution.

At first it seems that this problem is easy to solve. Instead of collecting sets of states, you need to collect state relationships with parent states. Then, after finding the solution, you can simply go back from the list of parental solutions from end to beginning. For a hash table based solution, this would look something like this:

def bfs(graph, start, end):
    visited = {start: None}
    todo = [start]
    while todo:
        node = todo.pop_first()
        if node == end:
            return trace_solution(node, visited)
        for kid in adjacent(node):
            if kid not in visited:
                visited[kid] = node
                todo.push_back(kid)
    return None
def trace_solution(state, visited):
  if state is None:
    return []
  return trace_solution(start, visited[state]) + [state]

Unfortunately, this will destroy all the compression benefits gained in the previous section; they are based on the assumption that adjacent lines are very similar. When we just look at the states themselves, this is true. But there is no reason to believe that this will be true for parental states; in fact, they are random data. Secondly, a sort + merge solution should read and write all the states viewed at each iteration. To save the linkage of the state / parent state, we have to read and write to the disk at each iteration all these poorly compressed data.

Sort + merge with multiple output

At the very end, when returning back to the solution, the program will only need bundles of states / parent states. Therefore, we can store two data structures in parallel. Visited will continue to be the set of states visited, as previously recalculated during the merge. Parents is at least a sorted list of state / parent state pairs that are not overwritten. After each merge operation, the pair “state + parent state” is added to parents.

def bfs(graph, start, end):
    parents = Stream()
    visited = Stream()
    todo = Stream()
    parents.add((start, None))
    visited.add(start)
    todo.add(start)
    while True:
        new = []
        for node in todo:
            if node == end:
                return trace_solution(node, parents)
            for kid in adjacent(node):
                new.push_back(kid)
        new_stream = Stream()
        for node in new.sorted().uniq():
            new_stream.add(node)
        todo, visited = merge_sorted_streams(new_stream, visited, parents)
    return None
# Merges sorted streams new and visited. New contains pairs of
# key + value (just the keys are compared), visited contains just
# keys.
#
# Returns a sorted stream of keys that were just present in new,
# another sorted stream containing the keys that were present in either or
# both of new and visited. Also adds the keys + values to the parents
# stream for keys that were only present in new.
def merge_sorted_streams(new, visited, parents):
    out_todo, out_visited = Stream(), Stream()
    while visited or new:
        if visited and new:
            visited_head = visited.peek()
            new_head = new.peek()[0]
            if visited_head == new_head:
                out_visited.add(visited.pop())
                new.pop()
            elif visited_head < new_head:
                out_visited.add(visited.pop())
            elif visited_head > new_head:
                out_todo.add(new_head)
                out_visited.add(new_head)
                out_parents.add(new.pop())
        elif visited:
            out_visited.add(visited.pop())
        elif new:
            out_todo.add(new.peek()[0])
            out_visited.add(new.peek()[0])
            out_parents.add(new.pop())
    return out_todo, out_visited

This allows us to take advantage of both approaches in terms of runtime and work sets, but requires more secondary storage space. In addition, it turns out that in the future, for other reasons, a separate copy of the visited states will be useful, grouped by depth.

Swap

Another detail is ignored in the pseudo-code: there is no explicit code for disk I / O, but only the abstract Stream interface. Stream can be a file stream or an array inside memory, but we ignored this implementation detail. Instead, the pseudo code is creating a memory access pattern that allows optimal use of the disk. In an ideal world, this would be enough, and the rest could be taken up by the OS virtual memory subsystem.

But this does not happen, at least on Linux. At some point (before the working set of data could be compressed to memory sizes), I got the program to run in about 11 hours, and the data was saved mainly to disk. Then I made the program use anonymous pages rather than stored in files, and selected a swap file of sufficient size on the same drive. However, three days later, the program went only a quarter of the way, and still, over time, it became slower. According to my optimistic estimates, she was supposed to finish the job in 20 days.

I’ll clarify - it was the same code and exactly the same access pattern. The only thing that has changed is that the memory was saved not as an explicit disk file, but as a swap. Almost no evidence is needed that swapping completely destroys Linux performance, while regular file I / O does not. I always assumed that this is due to the fact that programs tend to consider RAM as random access memory. But this is not the case.

It turns out that file-saving pages and anonymous pages are handled differently by the virtual machine subsystem. They are stored in separate LRU caches with different expiration policies; in addition, it looks like they have different read / load read-ahead properties.

Now I know: swapping on Linux most likely will not work well even under optimal conditions. If parts of the address space are likely to be unloaded for some time to disk, it is better to save them manually in files than trust the swap. I accomplished this by implementing my own class of vectors, which initially only works in memory, and after exceeding a certain size threshold, it switches to mmap in a temporary separate file.

Compression of new states before merging

In a simplified performance model, we assumed that 100 million new conditions would occur at each depth. It turned out that this is not very far from reality (in the most complex puzzle, a maximum of more than 150 million unique new states on one layer of depth). But this is not to be measured; the working set before merging is associated not only with unique states, but also with all states deduced for this iteration. This figure reaches 880 million output states per depth layer. These 880 million states a) need to be processed with a random access pattern for sorting, b) cannot be effectively compressed due to the lack of sorting, c) must be stored together with the parent state. This working set is approximately 16 GB.

The obvious solution: use some sort of external sorting. Just write all the states to disk, perform external sorting, deduplicate, and then merge as usual. At first I used this solution, and although it at most eliminated problem A, I could not cope with B and C.

In the end, I took an alternative approach: I collected the states into an array in memory. If the array becomes too large (for example, more than 100 million elements), then it is sorted, deduplicated, and compressed. This gives us a package of sorted state runs, and there are no duplicates inside each run, but they are possible between runs. Fundamentally, the code for merging new and visited states remains the same; it is still based on a gradual passage through the streams. The only difference is that instead of just passing through two streams, there is a separate stream for each of the sorted runs of new states.

Of course, the compression rates of these runs of 100 million states are not as good as the compression of the set of all visited states. But even with such indicators, it significantly reduces the volume of both the working set and the requirements for disk I / O. You need a bit more CPU resources to process the priority queue of threads, but it's still a great compromise.

Saving space on parent states

At this stage, the vast majority of the space occupied by the program is spent on storing the parent states, so that after finding the solution we can recreate its process. Most likely, they can hardly be squeezed well, but maybe there is some kind of compromise between the CPU and memory?

We need to connect the state S 'at the depth D + 1 with its parent state S at the depth D. If we can iterate over all possible parental states S', then we can check whether any of them appear at the depth D in the set visited . (We have already created a lot of visited, grouped by depth as a convenient by-product of the derivation of state / parental state bundles during merging). Unfortunately, this approach will not work for this task; it is simply too difficult for us to generate all possible states of S for a given S '. However, for many other search tasks such a solution might work.

If we can only generate transitions between states forward, but not backward, then why not just do this? Let's iteratively go around all the states at depth D and see what kind of output states they get. If some state at the output gives S ', then we found a suitable S. The problem with this plan is that it increases the total CPU consumption of the program by 50%. (Not 100%, because on average we will find S by looking at half the states at depth D).

Therefore, I do not like not one of the limiting cases, but here, at least, a compromise between the CPU / memory is possible. Is there a more acceptable solution somewhere in between? In the end, I decided not to store the pair (S ', S), but the pair (S', H (S)), where H is an 8-bit hash function. To find S for a given S ', we again iteratively go through all the states at depth D. But before doing anything else, we calculate the same hash. If the output does not match H (S), then this is not the state we are looking for, and we can simply skip it. This optimization means that costly recalculations need to be performed only for 1/256 states, which represents a slight increase in CPU load, and at the same time reduces the amount of memory for storing parent states from 8-10 bytes to 1 byte.

What did not work or may not work

In the previous sections, we looked at the sequence of high-level optimizations that worked. I tried other things that did not work, or which I found in the literature, but decided that in this particular case they would not work. Here is a partial list.

At this point, I do not recalculate the entire set visited at each iteration. Instead, it was stored as many sorted runs, and these runs were compressed from time to time. The advantage of this approach is that fewer disk writes and CPU resources are used for compression. The disadvantage is increased code complexity and reduced compression rate. Initially, I thought that such a scheme makes sense, because in my case, write operations are more expensive than reading. But in the end, the compression ratio turned out to be twice as much. The advantages of such a compromise are not obvious, therefore, as a result, I returned to a simpler form.

Little research has already been done on performing a volumetric breadth-first search for implicitly defined graphs in the secondary storage, you can start exploring this topicfrom this 2008 article . As you might guess, the idea of doing deduplication along with sorting + merging in secondary storage is not new. What is surprising about this is that it was opened only in 1993. Pretty late! There are later suggestions for breadth-first search in secondary storage that do not require a sorting step.

One of them was to bind states to integers, and store in memory a bitmap of visited states. In my case, this is completely useless, because the sizes of the encoded state are very different compared to the really reachable state spaces. And I doubt very much that there are interesting problems in which such an approach would work.

Another serious alternative is based on temporary hash tables. Visited states are stored without sorting in a file. We save the output obtained from depth D to the hash table. Then iteratively go around the visited states and look for them in the hash table. If the item is found in the hash table, then delete it. After iteratively traversing the entire file, only non-duplicate elements will remain in it. They are then added to the file and used to initialize the todo list for the next iteration. If the amount of output is so large that the hash table does not fit in memory, then both files and hash tables can be divided into parts using the same criteria (for example, the upper status bits), and each part should be processed separately.

Although there are benchmarksshowing that the hash-based approach is about 30% faster than sorting + merging, but it seems that they do not take into account compression. I just did not see how the rejection of the benefits of compression can justify itself, so I did not even experiment with such approaches.

Another area of research worthy of attention was the optimization of database queries. Seem to be. that the deduplication task is strongly related to database join, which also has the exact same sorting versus hashing dilemma . Obviously, some of these studies can be applied to the search problem. The difference may be that the join database output is temporary, while the BFS deduplication output is stored until the end of the calculation. It seems that this is changing the balance of compromises: now it concerns not only the most efficient processing of one iteration, but also the creation of the most optimal output data format for the next iteration.

Conclusion

This concludes my account of what I learned from a project that is generally applicable to other search tasks by brute force. The combination of these tricks allowed to reduce the volume of solutions to the most complex puzzles of the game from 50-100 GB to 500 MB and to smoothly increase costs if the task exceeds the available memory and is written to disk. In addition, my solution is 50% faster than naive deduplication of states based on hash tables, even for puzzles that fit in memory.

Snakebird can be purchased on Steam , Google Play, and the App Store . I recommend it to anyone who is interested in very complex, but honest puzzles.

Tags: