SkidanovAlex May 13, 2014 at 11:11

How did we speed up the request 100 times, or not all hash functions are equally bad

We are developing a database. Once, we were approached by a company that faced the following task:

There are some set of objects, and some set of tags. Each object can contain several tags. Some tags are very rare, and some are common. A single tag can be mapped multiple times to a single object.
New objects, tags and links between them are continuously added.
The task is to quickly answer questions like: “how many objects have a tag A or B, but no tag C” and similar. I would like to respond to such requests in tenths of a second, while not stopping the loading of data.

We received their data from them until today, deployed a test cluster of four machines, and began to think about how to distribute the data correctly and how to correctly present the task as an SQL query in order to get maximum performance. As a result, they decided that the request could look like:

SELECT 
    COUNT(*) 
FROM (
    SELECT 
        object_id, 
        (MAX(tag == A) OR MAX(tag == B)) AND MIN(tag != C) AS good
    FROM tags
    WHERE tag IN (A, B, C)
    GROUP BY object_id
) WHERE good == 1;

In order for such a request to be executed quickly, we divided the data between the cluster servers by object_id, and sorted them by tags inside each server. Thus, the server executing the request can send the request without changes to all data servers, and then simply add up their results. On each server with data to complete the request, it is enough to find the lines for tags A, B and C (and since the data on the tag is sorted, this is a quick operation), and then execute the request in one pass on these lines. The worst tag has several tens of millions of objects, it seems possible to process several tens of millions of lines in tenths of a second.
It is worth noting that the subquery contains GROUP BY object_id. GROUP BY in this situation can be done in several ways, for example, if the data after the tag is sorted by object_id, then you can do something similar to merge sort. In this situation, however, we did not sort the object_id data, and the optimizer reasonably decided that in order to execute GROUP BY, a hash table should be built.

We loaded all the data into the cluster, and launched the request. The request took 25 seconds.

Something went wrong.
First of all, we removed the GROUP BY object_id to localize where the problem occurred. The request ended in the expected 0.3 seconds. So reading data is fast enough, and nothing more is transmitted over the network. The problem is somewhere in the hash table.
We deployed a debug server, and began to write various statistics to the log. After some time, it became clear that in the hash table some values of the hash function appear much more often than others. Despite the fact that the number of chains was greater than the number of entries in the table, the longest chain had a length of about 32. So something with a hash function is wrong. Having opened the implementation, we saw something like the following:

uint64_t hash(uint64_t value) {
    return value * MULTIPLIER1;
}
uint64_t accumulateHash(uint64_t hash, uint64_t value) {
    hash ^= hash >> SHIFT1;
    hash += value;
    hash ^= hash >> SHIFT2;
    hash *= MULTIPLIER2;
    return hash;
}

The beat game seemed very suspicious. Someone clearly thought it was a good idea to make the function more confusing when writing this code, but it seems like it was overdone. We commented out both lines with XOR and SHIFT, and restarted the cluster. Request ended in 0.5 seconds, victory.

The next morning I decided to find out who originally wrote these two lines. Made git blame and found the ill-fated commit. Interestingly, in this commit, of these changes, only these two lines were added to the server code; before that, the accumulateHash function consisted of sum and multiplication. However, in addition to adding these two lines, the author of the commit also added the so-called avalanche test - a test that makes sure that any, even the most insignificant, changes in the input numbers lead to completely random changes in the hash function. The test was based on some publication of some candidate of sciences, and seemed reasonable. However, the test did not pass without XOR and SHIFT, but passed with them. That is, the author of the commit did not just write a confusing function, he understood what he was doing. But why in practice the function behaved so unpredictably?

To deal with this issue, I downloaded the data for one of the tags to the local machine, and began to experiment. Indeed, our current hash function produced collisions: all values had the same high five bits. However, the same function with other SHIFT1 and SHIFT2 already gave good results. Then I generated random 10 million numbers, and built the hash table again using our bad hash function. This time there were no abnormal collisions either. That is, on random numbers, the hash function behaves well, the problem is somewhere at the intersection of our hash function and user data. He began to look for patterns in user data. There are patterns: they are all divided by 64, and they all have the same high five bits. I generate a set of random numbers that are multiples of 64 with the same high five bits. No, anyway there are no collisions. What else could be the problem? Is it possible that the way that objects are generated will somehow level our hash function?

Having decided the next day to find out from the client exactly how they generate these IDs, I was almost ready to go home when one of my colleagues asked me: and this column by which I do GROUP BY is by chance not the same column by which I am data broke between servers?
Yes, this is indeed the same column. It turns out that due to a bug in the code, the same hash function was used to partition and execute GROUP BY. Each of the four physical servers had 8 partitions, totaling 32 partitions, the highest bits of the hash function are used to select the partition. As a result, for all lines in one section, the highest five bits of the hash function value from object_id will be the same. When we commented out XOR and SHIFT in a hash function, and restarted the cluster, we did not reload the data, thereby making it appear that the problem was fixed, since now the data was broken by a hash of a function different from that used for GROUP BY, but if we started downloading the latest data, then in the end the problem would have made itself felt again.

We returned the hash function to the form in which it passes the avalanche test, and changed the hash function for partitioning, and one of our colleagues shared a funny story: at a programming competition, he built a hash table in Java on two-dimensional coordinates. The coordinates were two numbers of 32 bits each, and for the code to work efficiently, he wrote them down into one 64-bit number and put them in a hash table. Unfortunately, the task did not pass in time, because in Java the hash function calculates the XOR of the upper and lower 32 bits, thereby always giving the same value for coordinates for which X is Y.

Tags:

How did we speed up the request 100 times, or not all hash functions are equally bad

Also popular now: