fediq February 28, 2016 at 07:12

Do it yourself: SQL JOIN in Java

Tutorial

I often interview developers and often ask them a simple, like sledgehammer, question - how does JOIN work in SQL internally? In response, I usually hear incoherent mooing about magic trees and indexes that are faster. Once it seemed to me that every specialist ~~programmer~~ should know what he is working with. Subsequently, life explained to me that this is not so. But I still don’t understand how you can fumble for years for the market , without even knowing what’s under her hood?

Let's do an educational program and together we will see how these joins work , and even we will implement a couple of algorithms ourselves.

SQL JOIN

Formulation of the problem

To people who claim to have worked a lot and tightly with SQL, I ask this kind of task at interviews.
There is a SQL command

SELECT
  left.K,
  left.V1,
  right.V2
FROM left
JOIN right
  ON left.K = right.K;

You need to do the same in Java, i.e. implement method

 List> join(List> left, List> right);

I do not ask you to directly code the implementation, but I am waiting for at least an oral explanation of the algorithm.

Additionally, I ask what needs to be changed in the signature and implementation in order to pretend that we are working with indexes.

Let's first understand why do we even need to think about joins?

Knowing a theory is useful for purely cognitive reasons.
If you distinguish between the types of joins, then the query execution plan, which is obtained by the EXPLAIN command, no longer looks like a set of obscure English words to you. You can see potentially slow places in the plan and optimize the query by rewriting or hinting.
In newfangled analytics tools on top of Hadoop, the query planner is either a little dumb (see Cloudera Impala) or not at all (see Twitter Scalding, Spark RDD). In the latter case, you have to collect the request manually from primitive operations.
~~Finally, there is a risk that one day you will get an interview with me or another bore.~~ But in fact, the article is not about interviews, but about the JOIN operation.

Nested loops join

The most basic algorithm for combining the two lists is now held in schools. The essence is very simple - for each element of the first list we will go through all the elements of the second list; if the keys of the elements suddenly turn out to be equal, we write the match in the resulting table. To implement this algorithm, two nested loops are sufficient, which is why it is called that.

public static   List> nestedLoopsJoin(List> left, List> right) {
    List> result = new ArrayList<>();
    for (Pair leftPair: left) {
        for (Pair rightPair: right) {
            if (Objects.equals(leftPair.k, rightPair.k)) {
                result.add(new Triple<>(leftPair.k, leftPair.v, rightPair.v));
            }
        }
    }
    return result;
}

The main plus of this method is complete indifference to the input data. The algorithm works for any two tables, does not require any indexes and data shifts in memory, and is also simple to implement. In practice, this means that it is enough to simply run around the disk with two cursors and periodically spit out matches in the socket.

However, the fatal minus of the algorithm is the high time complexity O (N * M) (quadratic asymptotic behavior, if you know what I mean). For example, for joining a pair of small tables of 100k and 500k records, you will need to do as many as 100,000 * 500,000 = 50,000,000,000 (50 billion) comparison operations. Requests with such a joining will be rude for a long time, often they are the reason for the merciless brakes of the curved self-written CMS.

Modern RDBMSs use nested loops join in the most hopeless cases where no optimization can be applied.

UPD zhekappp and potapuff fix that nested loops are effective for a small number of lines, when expanding any optimization will be more expensive than just running through a nested loop. There is a class of systems for which this is relevant.

Hash join

If the size of one of the tables allows you to put it in its entirety in memory, then you can make a hash table on its basis and quickly look for the necessary keys in it. Let's talk a little more.

Check the size of both lists. Take the smaller of the lists, read it in its entirety, and load it into memory by building a HashMap. Now back to the larger list and follow it with the cursor from the beginning. For each key, we check to see if there is the same in the hash table. If there is, we write the match in the resulting table.

The time complexity of this algorithm drops to linear O (N + M), but additional memory is required.

public static   List> hashJoin(List> left, List> right) {
    Map hash = new HashMap<>(right.size());
    for (Pair rightPair: right) {
        hash.put(rightPair.k, rightPair.v);
    }
    List> result = new ArrayList<>();
    for (Pair leftPair: left) {
        if (hash.containsKey(leftPair.k)) {
            result.add(new Triple<>(leftPair.k, leftPair.v, hash.get(leftPair.k)));
        }
    }
    return result;
}

What is important, in the time of dinosaurs it was believed that the right table should be loaded into memory, and iterated over the left. Normal RDBMSs now have cardinality statistics, and they themselves determine the order of the join, but if for some reason the statistics are not available, then the right table is loaded into memory. This is important to remember when working with young clumsy tools like Cloudera Impala.

Merge join

Now imagine that the data in both lists is sorted in advance, for example, in ascending order. This happens if we had indexes on these tables, or if we sorted the data in the previous stages of the query. As you probably remember, two sorted lists can be glued into one sorted in linear time - this is what the merge sort algorithm is based on . Here we have a similar task, but instead of gluing lists, we will look for common elements in them.

So, put the cursor at the beginning of both lists. If the keys under the cursors are equal, we write the match in the resulting table. If not, we look under which of the cursors the key is smaller. We move the cursor one smaller forward over the smaller key, thereby “catching up” with the other cursor.

public static , V1, V2> List> mergeJoin(
        List> left,
        List> right
) {
    List> result = new ArrayList<>();
    Iterator> leftIter = left.listIterator();
    Iterator> rightIter = right.listIterator();
    Pair leftPair = leftIter.next();
    Pair rightPair = rightIter.next();
    while (true)  {
        int compare = leftPair.k.compareTo(rightPair.k);
        if (compare < 0) {
            if (leftIter.hasNext()) {
                leftPair = leftIter.next();
            } else {
                break;
            }
        } else if (compare > 0) {
            if (rightIter.hasNext()) {
                rightPair = rightIter.next();
            } else {
                break;
            }
        } else {
            result.add(new Triple<>(leftPair.k, leftPair.v, rightPair.v));
            if (leftIter.hasNext() && rightIter.hasNext()) {
                leftPair = leftIter.next();
                rightPair = rightIter.next();
            } else {
                break;
            }
        }
    }
    return result;
}

If the data is sorted, then the time complexity of the algorithm is linear O (M + N) and no additional memory is required. If the data is not sorted, then you must first sort them. Because of this, the time complexity increases to O (M log M + N log N), plus additional memory requirements appear.

Outer joins

You may have noticed that the code written above imitates only INNER JOIN, and counts that all keys in both lists are unique, i.e. occur no more than once. I did this on purpose for two reasons. Firstly, it’s more obvious - the code contains only the logic of the joins themselves and nothing more. And secondly, I really wanted to sleep. Nevertheless, let's at least discuss what needs to be changed in the code to support various types of joins and non-unique key values.

The first problem is non-unique, i.e. duplicate keys. For duplicate keys, you need to generate the Cartesian product of all the corresponding values.
In Nested Loops Join, for some reason this works right away.
In Hash Join, you have to replace HashMap with MultiHashMap.
For Merge Join, the situation is much sadder - we will have to remember how many elements with the same key we saw.

Working with non-unique keys increases the asymptotics to O (N * m + M * n), where n and m are the average of the records per key in the tables. In the degenerate case, when n = N and m = M, the operation turns into CROSS JOIN.

The second problem is to keep track of keys for which no pair was found.
For Merge Join, a key without a pair is immediately visible for all directions of the JOIN.
For Hash Join, you can immediately see the lack of corresponding keys when joining on the left. In order to fix unpaired keys on the right, you have to set up a separate flag “there is a pair!” For each element of the hash table. After the main join is completed, it will be necessary to go through the entire hash table and add keys without the pair flag to the result.

For Nested Loops Join the situation is similar, and everything is so simple that I even mastered this code:

public static   List> nestedLoopsJoin(
        List> left,
        List> right,
        JoinType joinType
) {
    // Массив для обозначения ключей из правого списка, которым не нашлось пары в левом
    BitSet rightMatches = new BitSet(right.size());
    List> result = new ArrayList<>();
    for (Pair leftPair: left) {
        // Флаг для обозначения ключей в левом списке, которым не нашлось пары в правом
        boolean match = false;
        for (ListIterator> iterator = right.listIterator(); iterator.hasNext(); ) {
            Pair rightPair = iterator.next();
            if (Objects.equals(leftPair.k, rightPair.k)) {
                result.add(new Triple<>(leftPair.k, leftPair.v, rightPair.v));
                // Отмечаем пары
                match = true;
                rightMatches.set(iterator.previousIndex(), true);
            }
        }
        // Заполняем несоответствия в левом списке
        if ((joinType == JoinType.LEFT || joinType == JoinType.FULL) && !match) {
            result.add(new Triple<>(leftPair.k, leftPair.v, null));
        }
    }
    // Заполняем несоответствия в правом списке
    if (joinType == JoinType.RIGHT || joinType == JoinType.FULL) {
        for (int i = 0; i < right.size(); ++i) {
            if (!rightMatches.get(i)) {
                Pair rightPair = right.get(i);
                result.add(new Triple<>(rightPair.k, null, rightPair.v));
            }
        }
    }
    return result;
}

Moralizing afterword

If you read it here, then the article seemed interesting to you. So, you forgive me a small portion of moralizing.

I really believe that knowledge of SQL RDBMSs is absolutely not enough to consider myself a professional software developer. A professional should know not only his code, but also an approximate device of neighbors on the stack, i.e. 3rd-party systems that he uses - databases, frameworks, network protocols, file systems. Without this, the developer degenerates into a coder or computer operator, and in truly complex tasks it becomes useless.

UPD Despite this afterword, the article is actually about JOINs.

Additional materials

Luxurious article about RDBMS device
Luxury book about algorithms . It is useful to look through before the interview.

Disclaimer: Actually, I promised another article about Scalding, but the previous one did not cause much public interest. Because of this, it was decided to change the subject.

Tags: