kmike July 17, 2012 at 15:24

Python prefix trees

I recently completed a python library datrie , which implements a prefix tree (see Wikipedia or Habr ), I hasten to share.

In short, we can assume that datrie.Trie is a replacement for the standard python dict, which under certain conditions (keys - strings) takes up less memory, has a comparable speed of getting a single element and supports additional operations (getting all prefixes of a given string, getting all lines starting with a given line, etc.) that work about as fast as dictionary operations.

Works under Python 2.6-3.3, supports Unicode, LGPL license.

datrie is a cython wrapper on the libdatrie library. Libdatrie implements a variant of the prefix tree in which the state machine for transitions between nodes is stored in special 2 arrays + suffix compression is implemented (branches that do not branch are stored in another separate array of “tails”). This is not quite a “state of art” version of trie (HAT-trie and so on should be faster), but the option is quite fast / efficient and with a good ready-made implementation (and the implementation can kill any algorithm in practice).

The existing options for trie-shaped structures for python did not suit me. Purely pythonic implementations will inevitably consume a lot of memory, they are swept away immediately. Other implementations of trie-shaped structures for python:

trie.c in biopython. It does not support Unicode (although it’s OK), it does not work under 3.x (and to make it work under 3.x, you need to rewrite the wrapper well, because the library is implemented as a “bare” C-extension);
github.com/buriy/python-chartrie from Buriy - quite fast (on __getitem__ faster than datrie in total), takes a lot of memory (in order to fix this, you need to probably change the structure of the data, that is the basis of the library), little functional, does not support Unicode (although this is OK);
www.dalkescientific.com/Python/PyJudy.html - old, complex, it is unclear whether it works (it is written that supports 2.3 and 2.4)

Maybe there is something else, but, in any case, I didn’t find the “set it worked - it works for everyone!” Option, so I think that the desire to remember C and figure it out better in Cython and Python was quite successfully covered by the lack of a normal C-API trie implementations for python.

Installation (as is usually the case when installing extensions, you need a compiler):

pip install datrie

Creating:

import string
import datrie
trie = datrie.Trie(string.ascii_lowercase)

When creating, you need to immediately say which keys you can use with this trie (explicitly indicate the alphabet or ranges). This is a limitation of libdatrie, which allows you to effectively store state machines and support Unicode - first, valid ranges of unicode characters are specified, then unicode keys are translated into a more compact internal representation.

It seems to me that it would be possible in practice to manage with single-byte encodings like cp1251 and have approximately the same functionality and efficiency, but the approach with character ranges also works well, that's how it was done in libdatrie and okay. I therefore write that “does not support Unicode - this is OK” - in the case of trie, a single-byte encoding may be a convenient option.

Then with trie you can work like a dictionary:

>>> trie[u'foo'] = 5
>>> trie[u'foobar'] = 10
>>> trie[u'bar'] = 'bar value'
>>> trie.setdefault(u'foobar', 15)
10
>>> u'foo' in trie
True
>>> trie[u'foo']
5

The keys must be unicode :) In python 3.x this is quite natural, in 2.x you have to put the letter u in the examples, sorry. Values can be any pythonic objects, which is not typical for trie C-implementations (usually there is an integer as values). Actually, “inside” the values are really integers, but datrie.Trie uses them as indices in an array of “real” values. For those who do not need this feature (for example, the values are not interesting at all), datrie has datrie.BaseTrie, which is a little faster and can only store numbers.

A little bit about speed. All measurements were carried out on trie.Trie with a hundred thousand unique Russian and English words (50/50) and int-values of "1", other measurements (in a million urls) can be viewed hereor (even better) to conduct on their own data. I took speed measurements only in order to have some general idea of the order of quantities, and to track the regressions in the library, so take them accordingly; all source code and data are in the repository. In addition, I will not write anywhere about the asymptotic complexity of various operations, because not explored her. In theory, it should be like trie from Wikipedia (for example, getting an element or searching by prefix - O (m), where m is the key length), but implementation details (both libdatrie and my wrapper) can change everything; if someone builds the appropriate graphs, I will, of course, be grateful.

The operations of receiving an element, checking in, updating an element work on average 2-3 times slower than a standard dictionary (i.e. quickly, on my beech it is from 1 to 3 million operations per second for the aforementioned trie). An exception is the insertion of a new value into trie, it works much slower (about 50 thousand operations per second for the same trie with Russian and English words). At the same time, trie on such data takes up significantly less space in RAM: 3-5M (depending on the interpreter) vs 20M + in a regular dictionary (memory measurements were clumsy and I can’t vouch for specific numbers) .

Those. datrie.Trie can be used as a replacement for dict, if there are a large number of not very long lines (for example, words or urls), the data is mainly used in read-only mode (or “update-only”) and you want to save RAM memory at the cost of 2-3 times lower access speeds.

So that this pseudo-read-only is not very embarrassing, trie can be saved to a file and loaded from a file:

>>> trie.save('my.trie')
>>> trie2 = datrie.Trie.load('my.trie')

Another feature of trie is that some operations that in a dict (and any other hash table) would require a full enumeration or a large amount of memory to build additional indexes in the prefix tree work almost as fast (and some even faster) , as well as simply getting the value.

You can check if there are elements in trie whose keys start with this prefix (this is even faster than a simple in check):

>>> trie.has_keys_with_prefix(u'fo')
True

You can find all the prefixes of this line that are in this trie (this is slower, according to tests - 500-600 thousand operations / sec):

>>> trie.prefixes(u'foobarbaz')
[u'foo', u'foobar']
>>> trie.prefix_items(u'foobarbaz')
[(u'foo', 5), (u'foobar', 10)]

You can find all the elements whose keys begin with this line:

>>> trie.keys(u'fo')
[u'foo', u'foobar']
>>> trie.items(u'fo')
[(u'foo', 5), (u'foobar', 10)]
>>> trie.values(u'foob')
[10]

In the last example, most of the time is now spent on building results, rather than searching, this can be optimized; Now the speed was somewhere around 150-300 thousand values / sec (for example, with a prefix of length 7 and an average of 3 values, this is 70 thousand operations / sec).

Datrie.Trie still has various methods, help on them and additional results of speed measurements can be viewed in the README in the repository.

Under pypy, everything started up on the debian (10 times slower than under cpython); on a poppy under pypy it did not start (with a regular python it should work on a poppy, and on Linux, and under windows). C API - extensions in pypy will always be slow because work through a cpyext crutch, and cython generates the C extension API. You could write a wrapper in ctypes, but it would be slow under ordinary python (and not the fact that fast under pypy) + distributing ctypes extensions is inconvenient. The guys from pypy are now sawing cffi , promising that it will be fast (under both cpython and pypy). Plus, perhaps cython will learn someday to generate cffi extensions, not C extension APIs. Then, perhaps, happiness will come) In the meantime, I do not know what to do with pypy. I will not, probably; somehow everything works under Linux and okay.

During the implementation, I encountered a stalled utf_32_le codec in python. There is a bug with a patch ( bugs.python.org/issue15027 ), but the patch has not yet been committed. Initially, all operations in the datrie worked 10 times slower, but then in one place they managed to do without encoding the string with the standard python utf_32_le and everything worked better. This codec is used in a couple of "hot" places, so I think that if it is accelerated, some operations in the datrie can earn up to 2 times faster.

Tree iteration is also not the most efficient now, it is connected with the libdatrie interface features. But the author of libdatrie is an excellent person and is going to fix everything, so the prospects are not bad.

As usual, patches, bug reports, ideas, benchmarks, pull requests, etc. are welcome!

github/ bitbucket , whichever is more convenient.

Tags:

Python prefix trees

Also popular now: