Library sorting




    Take a reverse-ordered array and apply sorting to it with simple inserts .



    Look at the creaking of the next element in the right place. For it, you need to free up the insertion space, because of which you have to move all the previously inserted elements.

    And how it would be nice if there were free spaces between the elements inserted earlier! Then you would not have to drag a string of elements just for the sake of inserting one.

    In 2004, three computer science experts — Michael Bender, Martin Farah-Colton, and Miguel Mostiro — decided this way to modify the sorting with simple inserts. They proposed to form an ordered part of the array, leaving gaps between the inserted elements.
    The librarian needs the books to be arranged alphabetically on a long shelf: starting from the left of the letter “A”, the books stand flush with each other all the way to the “I”. If the library received a new book related to section "B", then to put it on the shelf in the right place, you have to move each book, starting somewhere in the middle of section "B" until the last "I". This is a sort of simple inserts. However, if the librarian reserved space in each section, it is enough for him to move only a few books to make room for new books. This is the basic principle of library sorting.

    Algorithm




    • 1. Create an empty auxiliary array, several times larger than the main one.
    • 2. For the next item, look for the insertion point in the auxiliary array.
      • 2.1. If you find a place to insert, then move the item and return to step 2.
      • 2.2. If there is no space for insertion, re-balance the auxiliary array and return to step 2.
    • 3. After processing all the elements we transfer them back to the main array.

    At first glance, it seems that sorting is easy and simple. To dispel this illusion, consider the key points of the algorithm in more detail.

    Size of auxiliary array


    While there is no established opinion, how many times the auxiliary array should be larger than the main one.

    If you take too much, then there will be a lot of space between the elements, but the search for the insertion point and rebalancing will be slower, due to the large distances between the elements. Rebalancing will happen less frequently, but will have to spend more resources on them. If you take too little, searching and rebalancing will be cheaper, but you will have to reformat the array more often. In general, there is still need to test with different values ​​and method of scientific spear to determine the best option.

    If we determine how many times the auxiliary array is larger than the main one, then the formula for determining the exact number of elements for it looks like this:

    NewSize = ε × (Size + 1) - 1

    NewSize - the number of elements in the auxiliary array
    ε - how many times the auxiliary array is larger than the main
    Size - the number of elements in the main array


    If we simply multiply the size by the factor: NewSize = Size × ε , then for uniform distribution we will not have enough cells in the number ε - 1 pieces That is, it is possible to arrange them evenly, but either the first filled cell or the last one will be right next to the edge of the auxiliary array. And we need the empty spaces of the filled cells to be reserved on all sides - including before the first element and after the last one.



    It seems to be a trifle, but in fact it is important to rebalance, in order to guarantee free places for insertion in any place, including when processing the last elements of the main array.

    Search insert location in auxiliary array


    Of course, a binary search is needed here. However, the classic implementation does not suit us.

    First, the auxiliary array basically consists of emptiness. Therefore, recursively dichotomizing the structure, we will mostly stumble upon unfilled cells. In these cases, you need to go a little left or right to the nearest non-empty cell. Then at the ends of the segment there will be significant elements that allow to calculate the arithmetic average and continue the binary depth search.

    Secondly, do not forget about the edges. If you need to insert a minimum or maximum element, then a binary search among those previously inserted will not lead to anything. Therefore, it is worthwhile to provide for boundary cases - first check whether it is necessary to put an element near the left or right border of the array and, if not, then use a binary search.

    Third, taking into account the specifics of the application, it is worth making additional amendments to minimize the number of rebalancing of the array. If the inserted element is equal to the value on one of the ends of the segment, then, perhaps, you should not insert it in the middle of the segment. It is more logical to put next to an element equal in value. This will more effectively fill the empty space of the auxiliary array.

    You can come up with the fourth, fifth, and so on. The quality of the search for the insertion point directly affects the speed of sorting, since the choice of unsuccessful places for insertion leads to unnecessary rebalancing. For example, it may be worth dividing the segments not exactly in the middle, but closer to the left or right edge of the segment, depending on which edge is closer in value to the inserted element?

    The binary search algorithm itself harbors pitfalls, and taking into account the above-mentioned additional nuances it finally becomes a completely non-trivial task.

    Array rebalancing


    Binary search is not the hardest thing to implement in this sorting. There is still rebalancing.

    When there is no space for insertion (elements close by value are found, but there are no free cells between them), you need to shake the auxiliary array so that the insertion space is free. This shaking of the array is rebalancing.

    Moreover, rebalancing is local or complete .

    Local rebalancing


    We shift as many elements as necessary to release the insertion point. The implementation of such a balancing is very simple, you just need to find out the empty cell nearest to the insertion point and use it to move several elements.

    There are possible nuances. For example, in which side to look for the nearest vacant place? To avoid a situation where a shift is not possible (that is, if in one of the sides all cells are occupied to the very edge of the array), you can focus on the position of the insertion point relative to the middle of the array. If you need to insert in the left part of the array, then move to the right, if in the right - left. If ε ≥ 2, then this approach eliminates the situation in which a shift is impossible (because half of the auxiliary array has more than enough space for all elements).

    In the author's interpretation of the algorithm, it is the local rebalancing that is meant.

    Full rebalancing


    An interesting local alternative is complete rebalancing. That is, in the auxiliary array, shift all the existing elements so that there are (almost) identical gaps between them.



    I tried both options and for the time being I observe that with local rebalancing the algorithm works 1.5-2 times faster than with full. However, from the complete rebalancing can still be a good judge. For example, if local rebalancing has to be done too often, this means that there are many “blood clots” in the array in certain areas, slowing down the whole process. A complete rebalancing, carried out one-time, allows you to get rid of all local congestion in one fell swoop.

    Let us analyze exactly how to produce a complete rebalancing.

    First you need to calculate how many cells we can allocate for each element in the auxiliary array. It should be remembered that empty cells must be both before the first and after the last filled cell. The formula is as follows:



    M - the number of cells that can be allocated to each element
    NewSize - the size of the auxiliary array
    Count - the current number of non-empty elements in the auxiliary array


    This fraction needs to be reduced to an integer value (that is, rounded down). From the formula it is obvious that the more elements already transferred to the auxiliary array, the less cells we can allocate for the neighborhood of each element.

    Knowing M, we easily get the exact position for each non-empty element in the auxiliary array, on which it should be after the rebalancing is completed.

    NewPos = Number × M

    NewPos - new element position after rebalancing
    Number - what is the non-empty element in the auxiliary array (1 ≤ Number ≤ Count)
    M - the number of cells allocated for each element


    New positions are known, can one just go through non-empty elements in the auxiliary array and transfer them to other places? Oh, no, don't hurry. It is necessary not just to transfer the elements, it is important to preserve their order. As a result of binary search and insertion, elements can appear both strongly left and strongly to the right of the position in which they should be after rebalancing. And on the place where you need to move, there may be another element that also needs to be attached somewhere. In addition, it is impossible to transfer an element if there are other elements between its old and new positions in the auxiliary array - otherwise the elements will mix, and it is extremely important for us not to confuse the order.

    Therefore, to rebalance it will not be enough to go through the usual cycle and simply shift each element from one pocket to another. We'll have to use recursion. If an element cannot be transferred to a new place (between its old and new positions there were other elements), then you first need to deal (recursively) with these uninvited guests. And then everything will be placed correctly.

    Degenerate case


    For most sorting inserts, a reverse-ordered array is the worst situation. And the sorting of the librarian, alas, is no exception.



    The elements tend to the left edge of the auxiliary array, with the result that empty spaces quickly end. It is necessary to rebalance an array very often.

    By the way, if we take an almost ordered array (the best case for sorting with simple inserts), we get the same problem. The newly arriving elements will hammer not the left, but the right side of the auxiliary array, which will also lead to too frequent rebalances.

    Library sort effectively handles random data sets. Partial ordering (both direct and inverse) degrades speed performance.

    Algorithmic complexity


    On large sets of random data, the algorithm gives the time complexity O ( n log n ). Not bad at all!

    On sets of random unique (or mostly unique) data, with proper selection of the coefficient ε and successful implementation of the binary search, the number of rebalances can be minimized, or even avoided. It can be argued that the algorithm has the best time complexity O ( n ).

    A large percentage of recurring data, as well as the presence of ordered (in direct or reverse order) subsequences in the array, leads to frequent rebalancing of the auxiliary array and, as a result, to the degradation of the time complexity to O (n 2) in the most adverse cases.

    The minus of the algorithm, of course, is that O ( n ) additional memory is required for the auxiliary array .

    Possible ways to improve


    Although the algorithm itself is instructive and effective on random data, for a decade and a half, very few people showed interest in it.

    If you google at the request of the “library sort”, you will find a cursory article in the English Wikipedia, the author's PDF (from which little is clear) and the rare reposting of this scant information. Plus there is a good visualization in Youtube, where the main and auxiliary arrays are originally combined. All links are at the end of the article.

    Searching for the query “library sorting” is even more fun — in the sample you will find out by different sortings from different libraries, but these algorithms will have no relation to authentic library sorting .

    And there is something to improve:

    1. Empirical selection of the optimal coefficient ε .
    2. Modification (taking into account the specifics of the general algorithm) of the binary search for the most efficient determination of the insertion point.
    3. Minimize rebalancing costs.

    If you polish these places, then maybe the library sort will even be equal to quick sort in speed?

    Source


    I did not have time to prepare the implementation in Python, there is a working version in PHP.

    Main algorithm
    functionLibrarySort($arr){
    	global $arr_new;//Вспомогательный массив
    	$e = 3;//Во сколько раз увеличиваем массив
    	$rebalance_local = true;//Локальная (true) или полная (false) перебалансировка//Создаём вспомогательный массив
    	$newSize = $e * (count($arr) + 1) - 1;
    	$arr_new = array_fill(0, $newSize, null);
    	//Первый элемент переносим в середину вспомогательного массива
    	$arr_new[LibrarySort_Pos(1, 1, $newSize)] = $arr[0];
    	//Начиная со второго элемента - ищем место вставки //и переносим во вспомогательный массив
    	$start = 0; $finish = $newSize - 1;	$i = 1;
    	//Перебираем по порядку элементы основного массиваwhile($i < count($arr)) {
    		//Бинарным поиском ищем место вставки в дополнительном массиве
            $pos = LibrarySort_BinarySearch($arr[$i], $start, $finish, $newSize);
            if($pos === false) {//Точка вставки не обнаружена ни в каком виде//Полная перебалансировка вспомогательного массива
    			LibrarySort_Rebalance_Total($i, $newSize);
    		} else {//Точки вставки нашлась, но надо проверить, свободна ли онаif($arr_new[$pos] !== null) {//Точка вставки занятаif($rebalance_local) {//Локальная перебалансировка (+ вставка)
    					LibrarySort_Rebalance_Local($arr[$i++], $pos, $newSize);
    				} else {//Полная перебалансировка
    					LibrarySort_Rebalance_Total($i, $newSize);
    				}
    			} else {//Нашли свободную точку вставки, просто вставляем
    				$arr_new[$pos] = $arr[$i++];
    			}
    		}
    	}
    	//Переносим из дополнительного массива в основной
        $pos = 0;
        for($i = 0; $i <= $newSize - 1; $i++) {
            if($arr_new[$i] !== null) $arr[$pos++] = $arr_new[$i];
        }	
    	return $arr;
    }

    New element position in additional array after complete rebalancing
    //Позиция $number-го элемента из имеющихся $count//во вспомогательном массиве после перебалансировки//$number - какой по счёту в массиве (с начала) элемент//$count - сколько на данный момент перенесено элементов//$newSize - полный размер вспомогательного массива//$number <= $count <= count($arr) <= $newSize)functionLibrarySort_Pos($number, $count, $newSize){
    	return $number * floor(($newSize + 1) / ($count + 1)) - 1;
    }

    Binary search for insertion point in auxiliary array
    //Бинарный поиск места вставки во вспомогательном массиве//$search - значение элемента из основного массива, который нужно перенести во вспомогательный//($start, $finish) - крайние индексы диапазона, в котором происходит поиск//$newSize - полный размер вспомогательного массиваfunctionLibrarySort_BinarySearch($search, $start, $finish, $newSize){
    	global $arr_new;//Вспомогательный массив//Сначала разберёмся с левой границей диапазона//На левой границе диапазона должен находиться элемент, а не пустая ячейка//Сужаем слева диапазон, если на левом краю пустые элементыwhile($arr_new[$start] === null && $start < $newSize - 1) {
            ++$start;
        }
        //Если на границе такого же значения элемент что и искомый,//то нужно вставить сразу за ним или перед нимif($search == $arr_new[$start]) {
            return LibrarySort_PosNearby($start, $newSize);
        //Может встретиться случай, когда искомый элемент меньше чем левая граница
        } elseif($search < $arr_new[$start]) {
            //Тогда искомому элементу место между левой //границей и началом дополнительного массиваif($start > 0) {//Перед $start есть свободное место
    			$finish = $start;
    			$start = 0;
    			return floor(($start + $finish) / 2);
    		} else {//$start == 0, перед ним ничего не вставитьreturn $start;//Если не нашли пустую ячейку, возвращаем тогда непустую
    		}
        }
        //На правой границе диапазона должен находиться элемент, а не пустая ячейка//Сужаем справа диапазон, если на левом краю пустые элементыwhile($arr_new[$finish] === null && $finish > 0) {
            --$finish;
        }
        //Если на границе такого же значения элемент что и искомый,//то нужно вставить сразу за ним или перед нимif($search == $arr_new[$finish]) {
            return LibrarySort_PosNearby($finish, $newSize);
        //Может встретиться случай, когда искомый элемент больше чем правая граница
        } elseif($search > $arr_new[$finish]) {
            //Тогда искомому элементу место между правой //границей и концом дополнительного массиваif($finish < $newSize - 1) {//После $finish есть свободное место
    			$start = $finish;
    			$finish = $newSize - 1;
    			return ceil(($start + $finish) / 2);
    		} else {//$finish == $newSize - 1, после него ничего не вставитьreturn $finish;//Если не нашли пустую ячейку, возвращаем тогда непустую
    		}
        }
    	//Искомый элемент не совпадает с краями, //значит, его нужно вставить где-то внутри диапазона//При этом нужно проверить, есть ли внутри диапазона ещё элементыIf($finish - $start > 1) {//Делить диапазон имеет смысл, если там минимум 3 элемента
    		$middle = ceil(($start + $finish) / 2); //Индекс середины диапазона
            $middle_Pos = 0; //Непустую "середину" диапазона пока не нашли
            $offset = 0; //Будем сдивгаться от точной середины в поисках непустого элемента//Итак, сдигаемся от середины влево/вправо, пока не обнаружим непустой элементwhile($middle - $offset > $start && $middle_Pos == 0){
                if($arr_new[$middle - $offset] !== null) {
                    $middle_Pos = $middle - $offset;
                } elseif($middle + $offset < $finish &&  $arr_new[$middle + $offset] !== null) {
                    $middle_Pos = $middle + $offset;
                }
                ++$offset;
            }
            //Если внутри нашли непустой элемент, то, в зависимости от его значения, //или ищем место возле него или рекурсивно применяем к левой или правой части диапазонаIf($middle_Pos) {
                if($arr_new[$middle_Pos] == $search) {
    				return LibrarySort_PosNearby($middle_Pos, $newSize);
                } else {
    				if($arr_new[$middle_Pos] > $search) {
    					$finish = $middle_Pos;
    				} else {//$arr_new[$middle_Pos] < $search
    					$start = $middle_Pos;
    				}
                    return LibrarySort_BinarySearch($search, $start, $finish, $newSize);
                }
            } else {//$middle_Pos == 0 - внутри диапазона (не считая его границ) не найдены непустые элементыreturn $middle;//Середина такого диапазона - место, куда можно вставить элемент
            }
        } else {//Нашли минимальный отрезок, но свободного места в нём нетreturn floor(($start + $finish) / 2);
    	}
    	returnfalse;//Если сюда добрались, значит бинарный поиск ничего не дал
    }

    If the search element is equal to one of the ends of the segment
    //Если элемент равен концу отрезка, то ставим его рядом слева или справа//$start - позиция, на которой уже стоит элемент равный искомому//$newSize - полный размер вспомогательного массиваfunctionLibrarySort_PosNearby($start, $newSize){
    	global $arr_new;//Вспомогательный массив//Сначала попытаемся найти свободное место перед элементомfor($left = $start - 1; $left >= 0; $left--) {
            if($arr_new[$left] === null) {//Пустая ячейкаreturn $left;//Сюда и вставим
            } elseif($arr_new[$left] <> $arr_new[$start]) {//Встретился элемент с другим значениемbreak; //Слева ничего не поставишь, можно дальше влево не искать
            }
        }
    	//Если не нашли свободного места слева, попробуем поискать справаfor($right = $start + 1; $right <= $newSize - 1; $right++) {
            if($arr_new[$right] === null) {//Пустая ячейкаreturn $right; //Сюда и вставим
            } elseif($arr_new[$right] <> $arr_new[$start]) {//Встретился элемент с другим значениемbreak; //Справа ничего не поставишь, можно дальше вправо не искать
    		}
        }
    	return $start; //Ни слева ни справа от конца отрезка не нашли места вставки. Но мы возвращаем хотя бы точку, с которой начали поиск
    }

    Local rebalancing of additional array
    //Местная перебалансировка вспомогательного массива//$insert - значение, которое надо вставить//$pos - начиная с какого элемента нужно сдвинуть несколько штук влево или вправо//$newSize - полный размер вспомогательного массиваfunctionLibrarySort_Rebalance_Local($insert, $pos, $newSize){
    	global $arr_new;//Вспомогательный массива//Уточняем $pos для $insert, иногда надо чуть левее или правееwhile($pos - 1 >= 0 && $arr_new[$pos - 1] !== null && $arr_new[$pos - 1] > $insert) {--$pos;}
    	while($pos + 1 <= $newSize - 1 && $arr_new[$pos + 1] !== null && $arr_new[$pos + 1] < $insert) {++$pos;}
    	$middle = (integer) $newSize / 2;//Середина массиваif($pos <= $middle) {//Точка вставки в левой части массиваif($arr_new[$pos] !== null && $arr_new[$pos] < $insert) ++$pos;
    		//Сдвигаем вправо
    		$right = $pos;
    		while($arr_new[++$right] !== null) {}
    		for($i = $right; $i > $pos; $i--) {
    			$arr_new[$i] = $arr_new[$i - 1];
    		}
    	} else {//Точка вставки в правой части массива	if($arr_new[$pos] !== null && $insert < $arr_new[$pos]) --$pos;
    		//Сдвигаем влево
    		$left = $pos;
    		while($arr_new[--$left] !== null) {}
    		for($i = $left; $i < $pos; $i++) {
    			$arr_new[$i] = $arr_new[$i + 1];
    		}		
    	}
    	$arr_new[$pos] = $insert;
    }

    Full rebalancing of an additional array
    //Полная перебалансировка вспомогательного массива//$count - сколько элементов на данный момент в массиве//$newSize - полный размер вспомогательного массиваfunctionLibrarySort_Rebalance_Total($count, $newSize){
    	global $arr_new;//Вспомогательный массивglobal $library_Number;//Какой по счёту элемент переносимglobal $library_LeftPos;//Самая левая позиция при которой влево переносили элементы
        $library_Number = $count; //Начинаем переносить последние с конца по счёту элементы
        $library_LeftPos = $newSize - 1;//Пока неизвестна, поэтому максимально правое значение//Проходим элементы в дополнительном массиве от последнего к началу
        $i = $newSize - 1;
        while($i >= 0) {
            if($arr_new[$i] !== null) {//Очередной непустой элемент
    			$pos = LibrarySort_Pos($library_Number, $count, $newSize);//Его надо перенести сюдаnewSize=$newSize");if($i == $pos) {//Элемент уже стоит на своём месте
                    --$i;//Переходим дальше влево по дополнительному массиву
    			} elseif($i < $pos) {//Элемент нужно перенести вправо
    				$arr_new[$pos] = $arr_new[$i];
    				$arr_new[$i] = null;
    				--$i;//Переходим дальше влево по дополнительному массиву
                } else {//$i > $pos - элемент нужно перенести влево//Рекурсивная процедура для переноса элемента влево
                    LibrarySort_RemoveLeft($i, $pos, $count, $newSize);
    				$i = ($i > $library_LeftPos) ? $library_LeftPos - 1 : --$i;
    			}
    			--$library_Number;//Переносимых элементов стало на одного меньше
    		} else {//Пустая клетка, нечего переносить
    			--$i;//Переходим дальше влево по дополнительному массиву
    		}
        }    
    }

    Moving an item to the left with full rebalancing
    //Перенос элемента левее в перебалансируемом массиве. //Этому могут мешать другие элементы, находящиеся слева//$i - в какой ячейке находится элемент, который нужно перенести//$pos - в какую ячейку нужно перенести элемент//$count - сколько всего сейчас элементов во вспомогательном массиве//$newSize - полный размер вспомогательного массиваfunctionLibrarySort_RemoveLeft($i, $pos, $count, $newSize){
    	global $arr_new;//Вспомогательный массивglobal $library_Number;//Какой по счёту элемент переносимglobal $library_LeftPos;//Самая левая позиция при которой влево переносили элементы
        $left = false; $left_Pos = false;//Пока ближайший сосед слева не найден
        $j = $i;//Начинаем перебирать сразу от переносимого элемента//Нужно выяснить ближайшего соседа слеваwhile($j > 0 && $left === false) {//Пока в пределах вспомогательного массива и пока не найден ближайший сосед слева
            --$j; //Продолжаем влево поиск ближайшего соседаif($arr_new[$j] !== null) $left = $j;//Ближайший сосед слева обнаружен
        }
        if($left === false || $left < $pos) {//Ближайший сосед слева (если он найден) не мешает переносу//Ничего дополнительно предпринимать не нужно
        } else { //$left >= $pos, сосед слева мешает переносу
            --$library_Number;//Значит, левее нужно перенести мешающего соседа слева
            $left_Pos = LibrarySort_Pos($library_Number, $count, $newSize);//Соседа слева надо перенести сюда//Рекурсивно переносим соседа слева на его правильную позицию
            LibrarySort_RemoveLeft($left, $left_Pos, $count, $newSize);
            //Сосед слева отодвинут, теперь можно перенести элемент
        }
    	//С соседом слева всё нормально, переносим элемент
    	$arr_new[$pos] = $arr_new[$i];
    	$arr_new[$i] = null;
        //Уточняем, был ли это самый влево перенесённый из соседейif($pos < $library_LeftPos) $library_LeftPos = $pos;
    }

    I had to code from scratch myself, based on the general description of the method. I did not see the speed close to quick sorting; my library sort variant sorts 10-20 times slower than quick sort. But the reason, of course, is that my implementation is too raw, much has not been taken into account.

    I would like to see the version from the creators of the algorithm. I will write to the authors today (and throw off a link to this link), they will suddenly respond. Although ... I remember I tried to contact Allen Bichik ( ABC-sorting ) and Jason Morrison ( J-sorting ), but the result was the same as if I had written to Sportloto.

    UPD. Martin Farah-Colton replied that they never did the implementation:
    I’m never accepted that algorithms.
    The main thing - the idea :)

    Algorithm Characteristics

    TitleLibrary Sort, Librarian Sort, Library sort
    Other nameGapped Insertion sort Sorting Inserts
    The authorsMichael A. Bender, Martin Farach-Colton, Miguel Mosteiro
    Year2004
    ClassSorting inserts
    Comparisonsthere is
    Time complexitythe bestO ( n )
    averageO ( n log n )
    the worstO ( n 2 )
    Extra memory difficultyO ( n )

    Links


    Library sort

    Library Sort algorithm visualization

    Insertion sort is O (n log n)

    The authors of the algorithm:



    Michael A. Bender
    Martin Farah-Colton
    Miguel Mosteiro



    Articles series:





    Sorting added to AlgoLab. So you can experiment with small data sets.

    You can decide for yourself how many times the auxiliary array is larger than the main one. To select the coefficient ε, you can right-click on the cell with the “Library sorting” and select the item “Change note”. And in a note, carefully put an integer value for e from 2 to 5. If you enter something else instead of these numbers, the default value = 2 will be used.

    You can also choose the type of rebalancing. If you set local = 1, then local rebalancing will be used. If local = 0, complete.

    And do not forget to set the optimal scale for the process sheet before starting the visualization, otherwise the auxiliary array will not fit on the screen.


    Also popular now: