Sort by ... hash table (also by counting by tree and by HashMap)

From the sandbox

Three days ago I was thinking about combining sorting with counting and wood. After discussing it with a colleague, we came up with the following solution: instead of using a TreeSet, use a HashMap (if you have a TreeSet here, you can see below). But this seemed to me a little, so I decided to implement my own hash table and see what would come of it. The results seemed quite interesting to me.

All three types of sorting are great for cases where the power of the set of unique elements in the array is relatively small (which is understood by this, it will become clear after we look at the test results).

Sort by tree count

We build a tree of Pairs (Key, Quantity), where the Key is responsible for the element of the array, and Quantity is the number of repetitions of this element of the array. The tree is naturally balanced, black and red, for example.

Then everything is logical. We add all elements of the array to the Pair, and we search for the Pair in the tree (to avoid re-creating objects, we use the previously created Pair, in which we change the Key. Here, the Quantity does not interest us, since we are looking for correspondence exclusively by Key). If there is already such a Key, increase the Quantity, otherwise add a new Pair (Key, 1).

We rewrite the array, deleting the vertex each time and writing down the Key as many times as its Quantity.

In order not to implement the tree itself, for the described algorithm I used a TreeSet that works on a black and red tree.

Using HashMap

As the key, we will store the array element, as the value by key, the number of repetitions.

Using my own hash table

I decided to resort to the method of open addressing. In the hash table, we store objects of the class Para, where first is the key, second is the number of repetitions. Naturally, a constructor was added that takes the initial size of the table and the load factor. The hash function is the simplest: we take the key modulo, in the case of repeated access we add a unit (the unit is well suited to the fact that the key array turns out to be the most ordered, and in the end we will have to sort all the keys by standard quick sorting, because in the table they may turn out to be out of order ). So we can say that this is a hashing and one more sorting union.

Collisions

Obviously, collisions may occur and this will have a bad effect on both the hash determination time and the sorting speed of the final hash array (ordered or nearly ordered data is sorted faster). But here's the thing: our hash function changes as the table expands. So specifically to pick up such data, it becomes much more difficult. Moreover, you can make random (in a certain range, of course) the load factor and the expansion coefficient, which will make it impossible to pre-select values that lead to collisions. Well, if in one table some data led to collisions, then after rehashing, the number of collisions can be significantly reduced (several times). The factorials of numbers will be a weak point, but they are incredibly small (up to 2 ^ 31, for example, there are only 11 (not taking into account 0)).

We will not accept cryptographic hash-f-and since you can forget about an almost ordered array of hashes (in the sense of ordering by keys).

Working hours

Sorting by tree-counting: in the worst case, O (n log n), at best - in linear time.
Sorting by the hash table: time for hashing + complexity of sorting the array of hashes (may vary depending on the chosen algorithm and implementation, so I will not indicate anything here). It is difficult to estimate the complexity due to the use of a hash function and possible different approaches to sorting hashes. This question needs additional research, but it is obvious that in the worst case (when the input data will be deliberately selected for a specific hash function at a specific stage of execution), the operation time will be O (n ^ 2).

What is the memory?

Sorting by tree-counting will require O (distinct (n)) additional memory.
For the hash table, we need O (distinct (n)) memory + memory to sort the hashes (depending on the chosen algorithm).

Here are the results in milliseconds that I got on random numbers when sorting an array of objects into 10 million items, with a range of values [0; x]:

On tests load factor in my hash table = 0.75, initial capacity = 20, the table is doubled each time

At x = 10:
2044 - built-in
285 - by counting-tree (Usatov sort)
276 - HashMap (Usatov-Prokurat sort )
140 - by my hash table (Usatov-Prokurat sort using MyHashTable)

x = 100:
2406 - built-in
455 - by counting-tree
283 - HashMap
134 - hash table

x = 1_000:
2171 - built-in
930 - counting-tree
380 - HashMap
209 - hash table

x = 10_000
2879 - built-in
1666 - counting-tree
634 - HashMap
326 - hash table

x = 100_000
4045 - built
2899 - counting-tree
866 - HashMap
762 - hash table

x = 1_000_000
4997 - fitted
5762 - counting tree-
2505- HashMap
1294 - hash table

x = 10_000_000
5083 - fitted
11480 - counting-tree
5099 - HashMap
3240 - hash table

As I said at the beginning, these sortings are best suited for those cases where the power of the array of array elements is sufficiently small relative to the size of the array.

It is also worth noting that the built-in sorting works much faster on ordered (almost ordered) data (or an array in reverse order), but my method does more quickly on random numbers.

Sort by tree count:

staticvoidusatovSort(Integer[] arr){
        TreeSet<MyPair> tree = new TreeSet<>();
        MyPair temp;
        MyPair mp = new MyPair();
        for (int i : arr) {
            temp = mp;
            temp.first = i;
            temp = tree.ceiling(temp);
            if (temp != null && temp.first == i) // порядок условий важен, т.к если первое не выполняется, то проверка второго не производится
                temp.second++;
            else tree.add(new MyPair(i, 1));
        }
        int ptr = 0;
        while (!tree.isEmpty()) {
            temp = tree.pollFirst();
            for (int i = 0; i < temp.second; i++)
                arr[ptr++] = temp.first;
        }
    }

Sort by HashMap:

staticvoidusatovProkuratSortUsingHashMap(Integer[] arr){
        HashMap<Integer, Integer> hm = new HashMap<>();
        Integer temp;
        for (Integer i : arr) {
            temp = hm.get(i);
            if (temp == null) hm.put(i, 1);
            else hm.put(i, temp + 1);
        }
        ArrayList<Integer> keys = new ArrayList<>(hm.keySet().size());
        keys.addAll(hm.keySet());
        keys.sort(Comparator.naturalOrder());
        int ptr = 0;
        for (Integer i : keys)
            for (int j = 0; j < hm.get(i); j++)
                arr[ptr++] = i;
    }

Sort through my hash table:

staticvoidusatovProkuratSortUsingMyHashTable(Integer[] arr){
        MyHashTable mht = new MyHashTable();
        for (Integer i : arr)
            mht.add(i);
        MyPair[] res = mht.getPairs();
        int ptr = 0;
        Arrays.sort(res, Comparator.comparingInt(o -> o.first));
        for (MyPair mp : res)
            for (int i = 0; i < mp.second; i++)
                arr[ptr++] = mp.first;
    }

Hash table implementation:

publicclassMyHashTable{
    private MyPair[] hashArr;
    privatedouble count = 0;
    privatedouble loadFactor = 0.5;
    privatedouble expansivity = 2;
    privatestaticfinalint DEFAULT_CAPACITY = 20;
    publicMyHashTable(){
        hashArr = new MyPair[DEFAULT_CAPACITY];
    }
    publicMyHashTable(double loadFactor){
        hashArr = new MyPair[DEFAULT_CAPACITY];
        this.loadFactor = loadFactor;
    }
    publicMyHashTable(int capacity){
        hashArr = new MyPair[capacity];
    }
    publicMyHashTable(int capacity, double loadFactor){
        hashArr = new MyPair[capacity];
        this.loadFactor = loadFactor;
    }
    publicMyHashTable(int capacity, double loadFactor, double expansivity){
        hashArr = new MyPair[capacity];
        this.loadFactor = loadFactor;
        this.expansivity = expansivity;
    }
    public MyPair[] getPairs() {
        MyPair[] pairs = new MyPair[(int) count];
        int ptr = 0;
        for (MyPair i : hashArr)
            if (i != null)
                pairs[ptr++] = i;
        return pairs;
    }
    public MyPair get(int key){
        int add = 0;
        while (true)
            if (hashArr[(key + add) % hashArr.length].first == key) {
                return hashArr[(key + add) % hashArr.length];
            } elseif (add++ == hashArr.length) returnnull;
    }
    publicvoidadd(int key){
        if (count / hashArr.length >= loadFactor) grow();
        int add = 0;
        while (true) {
            if (hashArr[(key + add) % hashArr.length] == null) {
                hashArr[(key + add) % hashArr.length] = new MyPair(key, 1);
                count++;
                return;
            }
            if (hashArr[(key + add) % hashArr.length].first == key) {
                hashArr[(key + add) % hashArr.length].second++;
                return;
            }
            add++;
        }
    }
    publicvoidadd(MyPair newMP){
        if (count / hashArr.length >= loadFactor) grow();
        int add = 0;
        while (true) {
            if (hashArr[(newMP.first + add) % hashArr.length] == null) {
                hashArr[(newMP.first + add) % hashArr.length] = newMP;
                count++;
                return;
            }
            if (hashArr[(newMP.first + add) % hashArr.length].first == newMP.first) {
                hashArr[(newMP.first + add) % hashArr.length].second += newMP.second;
                return;
            }
            add++;
        }
    }
    privatevoidgrow(){
        MyPair[] oldHash = hashArr;
        hashArr = new MyPair[(int) (expansivity * hashArr.length)];
        for (MyPair i : oldHash)
            if (i != null)
                innerAdd(i);
    }
    privatevoidinnerAdd(MyPair mp){
        int add = 0;
        while (true) {
            if (hashArr[(mp.first + add) % hashArr.length] == null) {
                hashArr[(mp.first + add) % hashArr.length] = mp;
                return;
            }
            if (hashArr[(mp.first + add) % hashArr.length].first == mp.first) {
                hashArr[(mp.first + add) % hashArr.length].second += mp.second;
                return;
            }
            add++;
        }
    }
}

Class Pair:

publicclassMyPairimplementsComparable<MyPair> {
    publicint first;
    publicint second;
    publicMyPair(){
    }
    publicMyPair(int first, int second){
        this.first = first;
        this.second = second;
    }
    @OverridepublicintcompareTo(MyPair o){
        return first - o.first;
    }
    @Overridepublicbooleanequals(Object o){
        if (this == o) returntrue;
        if (o == null || getClass() != o.getClass()) returnfalse;
        MyPair myPair = (MyPair) o;
        return first == myPair.first;
    }
    @OverridepublicinthashCode(){
        return first;
    }
}

Tags: