Visualize and deal with Hash Match Join
- Transfer
This post is the third part of the series on join operators (be sure to read part 1 - nested loops joins , and part 2 - merge joins ). The translation of the article was prepared specifically for students of the course "MS SQL Server Developer" .
Hash Match Joins are the reliable workhorses of physical connection operators.
While Nested Loops Join will fail if there is too much data to fit into memory, and Merge Join will require the input to be sorted, Hash Match will connect any data you submit (provided that an equality predicate is executed for the connection, and so far there is enough free space in your tempdb).
The hash match algorithm consists of two stages, which work as follows:
During the first build phase, SQL Server creates a hash table in memory from one of the input tables (usually the smallest of the two). The hashes are calculated based on the input keys and then stored along with the line in the hash table in the corresponding block. In most cases, there is only one row of data in each block, except when:
- There are lines with duplicate keys.
- The hash function creates a collision, and completely different keys receive the same hash (this is rare, but possible).
After creating the hash table, the “Probe” (verification) stage begins. In the second step, SQL Server calculates the key hash for each row in the second input table and checks if it exists in the hash table created in the first step. If there is a match for this hash, then it is checked whether the keys of the row (s) in the hash table and the rows from the second table really match (this check must be performed due to possible collisions).
A common version of the hash match algorithm occurs when, at the construction stage, it is not possible to create a hash table that can be completely stored in memory:
This happens when there is more data than can be stored in memory, or when SQL Server provides insufficient memory for a hash match connection.
When SQL Server does not have enough memory to store the hash table during the build phase, it continues to work, storing some blocks in memory and placing other blocks in tempdb.
In the validation phase, SQL Server concatenates the rows of data from the second table into blocks from the build phase in memory. If the block to which this line potentially corresponds is currently out of memory, SQL Server writes this line to tempdb for later comparison.
When matches for one block are completed, SQL Server clears this data from memory and loads the following blocks into memory. Then it compares the rows of the second table (currently located in tempdb) with the new blocks in memory.
As with every physical join statement in this series, details on the hash match statement can be found in Hugo Kornelis help on hash match .
What does Hash Match Join show?
Knowing the internal features of how hash match join works , allows us to determine what the optimizer thinks about our data and upstream connection operators, helping us focus on performance tuning.
Here are some scenarios to consider the next time you see that hash match join is used in your execution plan:
- While hash match join can combine huge data sets, constructing a hash table from the first input table is a blocking operation that prevents the execution of subsequent statements. In this regard, I always check if there is an easy way to convert hash match to nested loops or merge join. Sometimes this is not possible (too many rows for nested loops or unsorted data for merge join), but it is always worth checking whether a simple index change or improved estimates will result from updating statistics to the fact that SQL Server selects a non-blocking hash match join statement
- Hash match joins are great for large connections, as they can be transferred to tempdb, this allows them to make connections to large datasets, which can lead to a failed connection in memory using nested loops or merge join statements.
- If you see a hash match join statement , it means that SQL Server thinks the input is too large. If we know that our input data should not be so large, then it is worth checking if there are problems with statistics or estimation, due to which SQL Server incorrectly selects hash match join .
- When executed in memory, hash match join is pretty efficient. Problems arise when the build phase goes to tempdb.
- If I notice a small yellow triangle indicating that the connection goes to tempdb, I see why this happened: if there is more data than memory is available, there is little that can be done, but if the allocated memory seems unreasonably small, this may mean that we probably have one more problem with statistics that leads to too low estimates of the SQL Server optimizer.
Thanks for reading the article. You may also like my Twitter .
We covered this topic in a previous open lesson . Waiting for your comments!