Solving the problem of sorting trees using a binary Materialized Path

    Faced with the problem of implementing tree-like comments on one project, in this article I post my solution.

    Task description


    • Storage of hierarchical data (tree comments) in a MySQL relational database
    • Simple addition of nodes / branches
    • Selection of the whole tree in one query, with branches sorted in the right order

    In principle, the task is classical, and at first I solved it by combining the Adjacency List and Materialized Path . In other words, columns are added to the table

    id INT(11)
    parentid INT(11)
    mpath VARCHAR(255)

    In mpath, the full path of the branch was stored in the tree, for example / 1/3/4/5/8/9 , where the comment ID was listed through the "/" sign.

    With this approach, it is very easy to add new comments of almost any level of nesting. At the same time, the selection was made by the query:

    SELECT *
    FROM messages
    WHERE postid=$postid
    ORDER BY mpath ASC

    The problem arose with an increase in the number of comments. For example, there is a tree with the following paths:

    1
    1.2
    1.2.10
    1.2.3
    1.2.7
    4
    4.8
    4.8.9
    5
    5.6

    Here, the numbers indicate the ID, and the order is what it would be when used ORDER BY mpath ASC.
    Comment 1.2.10 goes to 1.2.3 and others, although it was added later (judging by the ID).

    The solution of the problem


    The solution to the problem was partially described here: http://habrahabr.ru/post/125729/ . The idea is to use not decimal IDs in the path, but to encode to another number system in order to have less restrictions on the path length. At the same time, separators in the form of dots or other characters are not needed, since the field will be used only for sorting.

    I used base 95, which is exactly the number of characters to print in ASCII.

     !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
    


    If for each value in the way we use a fixed length, we get the following upper threshold ID:
    1 - 95 ^ 1 - 1 = 94
    2 - 95 ^ 2 - 1 = 9 024
    3 - 95 ^ 3 - 1 = 857 374
    4 - 95 ^ 4 - 1 = 81 450 624
    5 - 95 ^ 5 - 1 = 7 737 809 374

    5 characters are enough to not worry about the maximum number of comments.
    The maximum nesting level, at the same time, for VARCHAR 255/5 = 51

    Theoretically, you can take not BASE95, but BASE256 (along with non-printable characters), but store it all binary in BLOB, though here I am not sure about the performance when sorting (I need to check ) Then with 3 characters, we could encode 16,777,215 variants, and 4 - 4,294,967,295.

    How it looks in code


    I will give my example of the implementation of this whole theory.

    // $mpath - хранит стандартный materialized path с разделителем "/"
    // При добавлении нового комментария:
    $db->Execute("
    	UPDATE messages SET 
    	mpath=?, 
    	bpath=?,
    	depth=? 
    	WHERE id=?
    ", array(
    	$mpath, 
    	bpath($mpath),
    	$depth, 
    	$messid
    ));
    // . . . 
    // константа с символами для кодирования
    define('BASE95', ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~');
    // . . . 
    // функция bpath
    function bpath($mpath, $sep = '/', $max_len = 5) {
    	$bpath = '';
    	$chunks = explode($sep, trim($mpath, $sep));
    	$zero = substr(BASE95, 0, 1);
    	foreach($chunks as $chunk) 
    		$bpath .= str_pad(dec2base($chunk, 95, BASE95), $max_len, $zero, STR_PAD_LEFT);
    	return $bpath;
    }
    

    And further selection:

    SELECT *
    FROM messages
    WHERE postid=$postid
    ORDER BY bpath ASC

    The dec2base function is taken from here .

    Such a solution, in my opinion, is very simple. If you also have beautiful solutions in the arsenal, share them in the comments.

    Also popular now: