
Solving the problem of sorting trees using a binary Materialized Path
Faced with the problem of implementing tree-like comments on one project, in this article I post my solution.
In principle, the task is classical, and at first I solved it by combining the Adjacency List and Materialized Path . In other words, columns are added to the table
In mpath, the full path of the branch was stored in the tree, for example / 1/3/4/5/8/9 , where the comment ID was listed through the "/" sign.
With this approach, it is very easy to add new comments of almost any level of nesting. At the same time, the selection was made by the query:
The problem arose with an increase in the number of comments. For example, there is a tree with the following paths:
Here, the numbers indicate the ID, and the order is what it would be when used
Comment 1.2.10 goes to 1.2.3 and others, although it was added later (judging by the ID).
The solution to the problem was partially described here: http://habrahabr.ru/post/125729/ . The idea is to use not decimal IDs in the path, but to encode to another number system in order to have less restrictions on the path length. At the same time, separators in the form of dots or other characters are not needed, since the field will be used only for sorting.
I used base 95, which is exactly the number of characters to print in ASCII.
If for each value in the way we use a fixed length, we get the following upper threshold ID:
1 - 95 ^ 1 - 1 = 94
2 - 95 ^ 2 - 1 = 9 024
3 - 95 ^ 3 - 1 = 857 374
4 - 95 ^ 4 - 1 = 81 450 624
5 - 95 ^ 5 - 1 = 7 737 809 374
5 characters are enough to not worry about the maximum number of comments.
The maximum nesting level, at the same time, for VARCHAR 255/5 = 51
Theoretically, you can take not BASE95, but BASE256 (along with non-printable characters), but store it all binary in BLOB, though here I am not sure about the performance when sorting (I need to check ) Then with 3 characters, we could encode 16,777,215 variants, and 4 - 4,294,967,295.
I will give my example of the implementation of this whole theory.
And further selection:
The dec2base function is taken from here .
Such a solution, in my opinion, is very simple. If you also have beautiful solutions in the arsenal, share them in the comments.
Task description
- Storage of hierarchical data (tree comments) in a MySQL relational database
- Simple addition of nodes / branches
- Selection of the whole tree in one query, with branches sorted in the right order
In principle, the task is classical, and at first I solved it by combining the Adjacency List and Materialized Path . In other words, columns are added to the table
id INT(11)
parentid INT(11)
mpath VARCHAR(255)
In mpath, the full path of the branch was stored in the tree, for example / 1/3/4/5/8/9 , where the comment ID was listed through the "/" sign.
With this approach, it is very easy to add new comments of almost any level of nesting. At the same time, the selection was made by the query:
SELECT *
FROM messages
WHERE postid=$postid
ORDER BY mpath ASC
The problem arose with an increase in the number of comments. For example, there is a tree with the following paths:
1
1.2
1.2.10
1.2.3
1.2.7
4
4.8
4.8.9
5
5.6
Here, the numbers indicate the ID, and the order is what it would be when used
ORDER BY mpath ASC
. Comment 1.2.10 goes to 1.2.3 and others, although it was added later (judging by the ID).
The solution of the problem
The solution to the problem was partially described here: http://habrahabr.ru/post/125729/ . The idea is to use not decimal IDs in the path, but to encode to another number system in order to have less restrictions on the path length. At the same time, separators in the form of dots or other characters are not needed, since the field will be used only for sorting.
I used base 95, which is exactly the number of characters to print in ASCII.
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
If for each value in the way we use a fixed length, we get the following upper threshold ID:
1 - 95 ^ 1 - 1 = 94
2 - 95 ^ 2 - 1 = 9 024
3 - 95 ^ 3 - 1 = 857 374
4 - 95 ^ 4 - 1 = 81 450 624
5 - 95 ^ 5 - 1 = 7 737 809 374
5 characters are enough to not worry about the maximum number of comments.
The maximum nesting level, at the same time, for VARCHAR 255/5 = 51
Theoretically, you can take not BASE95, but BASE256 (along with non-printable characters), but store it all binary in BLOB, though here I am not sure about the performance when sorting (I need to check ) Then with 3 characters, we could encode 16,777,215 variants, and 4 - 4,294,967,295.
How it looks in code
I will give my example of the implementation of this whole theory.
// $mpath - хранит стандартный materialized path с разделителем "/"
// При добавлении нового комментария:
$db->Execute("
UPDATE messages SET
mpath=?,
bpath=?,
depth=?
WHERE id=?
", array(
$mpath,
bpath($mpath),
$depth,
$messid
));
// . . .
// константа с символами для кодирования
define('BASE95', ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~');
// . . .
// функция bpath
function bpath($mpath, $sep = '/', $max_len = 5) {
$bpath = '';
$chunks = explode($sep, trim($mpath, $sep));
$zero = substr(BASE95, 0, 1);
foreach($chunks as $chunk)
$bpath .= str_pad(dec2base($chunk, 95, BASE95), $max_len, $zero, STR_PAD_LEFT);
return $bpath;
}
And further selection:
SELECT *
FROM messages
WHERE postid=$postid
ORDER BY bpath ASC
The dec2base function is taken from here .
Such a solution, in my opinion, is very simple. If you also have beautiful solutions in the arsenal, share them in the comments.