Fastest Indiana: Trie Key / Value Container
“It might seem like I'm not doing anything. But actually, at the cellular level, I’m very busy. "
Author unknown
In the 21st century, building programs increasingly resembles the Lego constructor. This approach implies that many "cubes" are invented before us. Actually, their elementary nature deceptively suggests that the resource of improvements for many years here has been practically exhausted and it remains for us to use what we have. But, oddly enough, by analogy with biology, elementary “cells” sometimes hide the most complex and thoughtful algorithms and it is here that all the most interesting battles are concluded. In this sense, programmers are reminiscent of doctors in terms of the versatility of the industry. There are physicians, veterinarians, surgeons, and there are those guys who can spend several months of work on a few lines of code.
“At Google, right now, while I say, in our server park, 1% of all CPUs are involved in computing within hash tables. As I speak, hash tables occupy more than 8% of all server RAM. And this is just what applies to C ++, I don’t know the situation in Java ”
Matt Kulukundis, CppCon 2017
At one time, an interesting idea came to my mind about how it is possible to place two sparse pages on one piece of memory along the way encoding the key values themselves in the container through jumps. This idea seemed quite interesting and fresh, and its verification took only a few dozen lines of code, so the next evening I overcame my curiosity to find out how much memory pages can be compressed in this way. I must say that this was only the beginning and over time, everything turned into a test of a huge number of hypotheses, measurements, versions. In the current version, it’s quite difficult to discern the outlines of that original “ideal” model. In such a task, as in a typical engineering task, it is very important to find a balance. It is difficult to find that simple universal mechanism, which, just like a machine shutter, in dirt, water and heat will work equally well.
“Making simple is sometimes many times more difficult than difficult.”
Mikhail Kalashnikov
The idea to create a Trie that will run faster than Hestablitz is not new. In 2001, Douglas Baskins described the working principle of Judy Array . The idea seemed so interesting that Hewlett Packard provided a whole group of engineers for this project. A well-optimized Trie is more like a small operating system. It has its own implementation of a memory manager, a whole list of compression and balancing algorithms for nodes, its own small ecosystem for managing the microworld in a container. Due to the complexity of the implementation, there are very few such projects in the public domain. In open repositories, almost the vast majority of Trie implementations have only a few hundred lines of code ( HP JudyArray about 20K, HArrayabout 8K LOC approx.). Those implementations that are more likely to be academic in nature, or created to solve specific problems for working with text. You are unlikely to be asked about Trie during an interview; such Key / Value containers are not included in the standard libraries of popular programming languages. And I must say, completely in vain. Already at the first examination, it comes to the understanding that a well-optimized Trie can run faster than hash tables, while being even richer in functionality than binary trees.
Information search, one of the fundamental tasks of computer technology. Actually, immediately after the computers taught something to read and store information, there was a need for an effective search. For all time, only three basic concepts were proposed for organizing quick searches. Binary trees - a way to organize a container in which the keys for the search must be strictly sorted. Hashtables - we obtain the address value through processing the key with a hash function. And Trie - where the key itself encodes the path to the value. In the literature you can find more details on how these algorithms work. For example, there is a graphical description of how these approaches differ from each other.
But little is said about how Trie is more interesting from the point of view of constructing a universal Key / Value container.
HashtableThis structure was designed for a very quick key search, in fact, sacrificing everything else. At the forefront is the hash function, on the quality of which almost everything depends. However, you can choose an excellent hash function, which on the test data will give an almost uniform distribution, but still do not control the situation at 100%. Perhaps in real work, on new data, your hash function will degenerate in a case close to the worst case. In this case, the super fast search will turn into something like full scan and O (N) instead of O (1). Another problem, the more data, the more collisions. On large volumes of data, collisions grow like a snowball and at some point, you will have to rebuild the entire hash table as a whole. This event is called Stop World event and means that you cannot guarantee close constant latency for the calling code. Also, a large amount of data will entail nonlinear, excessive memory consumption, many cells inside the hash table will be empty, and the keys themselves will be stored in buckets in uncompressed form. There is no effective way to search by range or by key pattern in a container. Even if the keys differ in only one last bit, the probability that they will be close in the structure in the same bucket tends to zero. For the processor, if you work with some kind of natural data set where the keys themselves are similar to each other (URL, FilePath, Words, etc.), often this will mean cache miss, which also does not add performance. But even if you have an ideal hash function, there are only a few keys in the container and there are no collisions at all, when you insert and search the key itself, you scan at least twice. The first time passing through a hash function and the second time when they came to the address, checking that the found key is really the one that is required. Trie is deprived of almost all of these shortcomings, but more about it below.
Binary treeSome kind of gold standard, especially when it comes to databases, is different variations of binary search. Most often, there are two modifications. Red Black Tree - binary search with efficient tree balancing and B + modification, if you need additional work with the disk. Binary search lacks many of the disadvantages of hash tables. The choice of a hash function is not needed, memory consumption is almost linear, inserts and searches in a predictable time, usually O (log (n)). The ability to search by a range of keys and set the type of data sorting. But the search speed itself is quite low. In a container with about 1 million keys, the key will be found in about 20 seek times. At the same time, if in an ideal hash table without collisions we talked about scanning the key twice, then here the key can be scanned dozens of times, at each stage where we need to compare keys with each other more-less-equal. In general, the binary tree works really well, but, unfortunately, not on a flat memory model. Each time we need to insert a new key, we insert it somewhere in the middle to preserve the sort order. Because of this, the balancing algorithms are quite complicated, because of this - spaces are left in extends to avoid moving the old data with each insertion.
TrieHere we return to our “dark horse” and the first thing to say, Trie always scans the key once. In essence, this means that it is Trie, at least theoretically, that can work faster than hash tables and even more so than binary trees. Another interesting property comes from here - the longer the keys, the greater the difference in speed between the hash tables and Trie in favor of the latter. In addition, this structure is very friendly with processor caches, because firstly, similar keys are located nearby in memory, and secondly, most often such keys use common memory segments and the next time you insert or search, it is likely that some of the key will already be loaded in L1 / L2 processor caches. If the keys are of a similar nature, such as URLs, the structure saves more memory due to prefix compression. Such keys will also be more efficiently scanned over a range. The binary tree will read each key from beginning to end, while Trie will only scan the “tails” of the keys. Obviously, Trie works better on a flat memory model, since it does not require constant balancing, unlike binary trees, and it does not require a complete rebuild of the tree on large volumes of data, unlike hash tables. There is no Stop World event.
What are the disadvantages? The first is that despite various node compression techniques like Patricia, this structure is very fond of long jumps. If we store data on the HDD, where seek time is a very expensive operation, then the hash table will work much faster. Indeed, despite the fact that she scans the key twice or more times, she will have only one seek time - position herself on the desired bucket, while Trie will have several such seek times, albeit on average less than in binary ( classic) tree. Also, scanning keys of random nature over a range in a binary tree will be much more effective because, again, there are many jumps when scanning a subtree. Another disadvantage is the complexity of the implementation of such a structure and the inability to specify custom sorting of keys, although this is not true for all implementations, for example,HArray can still custom key sorting.
JsonIn the previous examples, we compared the features of the work of different containers by such parameters as speed, memory, and the ability to scan over a range of keys. For most applications, replacing one container implementation with another will mean “bit optimization.” In heavily loaded projects, the gain will of course be substantially greater. Moreover, by highly loaded I mean not only the servers of large corporations to which there are a lot of client requests. For example, data archiving, graphic rendering, content indexing - these are all the same tasks where a regular key / value container can work inside the “engine” under a load of millions of requests per second. In such scenarios, the speed is never superfluous and optimization, conditionally, will mean twice that 16 GB of content will not be indexed for 4 hours, but only 2. And even so everywhere here we have a choice. We can use the existing key / value implementation of the container and not really think about the details of its operation. However, there is a whole class of tasks where using something other than Trie is completely impractical. We are talking about a number of text processing tasks, for example, how the Suffix Tree works. Also, as an example, Radix Tree found a successful application inside the Linux kernel and could hardly be replaced by something else. All these examples are well described in various literature, so I will not dwell on them in more detail. Instead, let me give you another interesting example. In general, it is very important to achieve uniformity in application architecture. It so often happens that a properly chosen abstraction, a template, as in the “pattern” is suitable for everything else. So JSON is a natural, intuitive format that can be saved in Trie in the same natural, intuitive way. How to do it? Simple enough, you just need to slice the JSON into keys, where Key is the path to the attribute and its value, and Value is the document number. Thus, inserting, updating, or deleting an attribute in the middle of a JSON document will not mean a complete rewrite, but only insert, update, or delete keys inside the container. Searching for any attribute or searching for a range of attribute values will mean only searching for keys or scanning a subtree inside Trie, without deserializing the entire document. All these operations are very fast. Removing a key from a Key \ Value container usually costs less than a hundred nanoseconds. In addition, Trie naturally compresses JSON through an inverted index. The fact is that if such documents are stored as separate files, the attributes in them will be duplicated. But if they are added to Trie, then in Trie all attributes will be presented once, regardless of the number of documents added to the container. This approach is somewhat reminiscent of the approach used in column data warehouses, but this time, it is used for document-oriented databases. In general, this topic deserves a separate article. This approach is somewhat reminiscent of the approach used in column data warehouses, but this time, it is used for document-oriented databases. In general, this topic deserves a separate article. This approach is somewhat reminiscent of the approach used in column data warehouses, but this time, it is used for document-oriented databases. In general, this topic deserves a separate article.
“A theory without practice is dead and fruitless, and practice without theory is useless and harmful”
P. L. Chebyshev
Recently, the HArray project was implemented on the network, in open access , implementing the Trie algorithm with many optimizations. The implementation turned out to be quite complicated, but in my opinion, quite effective, which is confirmed by benchmarks . This option is not final, there are many more ideas for improvements. In addition, the old version worked one and a half times faster, but after adding new functionality it slowed down significantly and this is also worth sorting out.
In addition to the basic functionality, fair deletion is implemented. Honestly, this is when the keys are not “nullified” but are honestly dismantled step by step, gradually freeing up memory. In the scenarios of massive addition and removal of keys, the structure operates in a “non-waste production” mode, parts of the old “dead” keys are used to build new ones. If the number of deletions exceeds the number of additions of new keys, then at some point the structure may give out “extra” pages of memory to the operating system, while defragmenting its own used pages. All kinds of tree walks, search by key template, saving to disk, the ability to override the sorting of keys in the container, and other functionality are also implemented.