
Non-cryptographic hash functions and DoS attack on them
Non-cryptographic hash functions are used where speed is important and the ability to attack the characteristics of a function is not so important. Recently, an attack on the algorithmic complexity of hash tables has been actively discussed by creating multiple collisions of the hash function, which can lead to DoS. We will consider modern non-cryptographic hash functions, conditions for their use, possible methods of protection against attacks on hash tables, and why it turned out that this is not so easy to fix.

Non-cryptographic hash functions
If everyone has heard cryptographic hash functions, then little is known about non-cryptographic (general purpose hash functions). Non-cryptographic functions are used where data is not affected by third parties (an attacker). For example, such functions can be used to build hash tables.
Criteria that are important for non-cryptographic hash functions:
- Output values must have uniform distribution . This criterion is important for constructing hash tables so that the search speed is optimal.
- Each input bit should in 50% of cases affect each output bit (avalanche - avalanche criterion ). This criterion is important for using a hash as a unique identifier for a document. Also, sometimes a hash can have a uniform distribution, but fail the avalanche criterion - in this case there will be keys for which the hash will not give an even distribution.
- There should be no correlation between pairs of bits at the output . We will not consider this, because usually there is no problem with this criterion.
- Speed . All the hashes listed are an order of magnitude faster than cryptographic ones.

Fnv-1a
2003 year. Speed: 4.5 cycles per byte.
A very simple function [1] (RFC [5]).
1. Uniform distribution: poor, up to 14 bits can be used for the hash table (up to 2 14 hash table slots). Slightly better in the variation of FNV-1a, which is recommended by default.
2. Avalanche criterion: bad. Also bad in FNV-1a.
Both criteria are significantly improved in the modified FNV-1a [2], where mixing at the end is added to FNV-1a.
Bob Jenkins' Hash (lookup3)
2006, 32 and 64 bit lengths, 32-bit instructions, speed: 2.62 cycles per byte (for messages 16 bytes long).
1. Uniform distribution: excellent for all bits.
2. Avalanche criterion: good. There is a slight problem with the upper bits.
The first version of the function was called One-at-a-Time.
CRC32
1975, length: 32, 64 and 128 bits, speed: 1.2–0.13 cycles per byte (for messages shorter than 64 bits, depending on the CPU).
1. Uniform distribution: poor distribution of the lower bits. Use the high bits for the hash table, then you can use all the tested bits - up to 16 high bits (up to 2 16 hash table slots).
2. Avalanche criterion: very bad. There is no avalanche effect. The purpose of the hash is affected: its output bits are designed to detect errors in the data channel. Also, CRC has an important property: you can split the message into parts, calculate the hash for each part, then combine the message, and calculate the hash of the entire message from the hashes of its parts.
MurmurHash 2
2008, 32 and 64 bit lengths, 32 and 64-bit instructions.
There is a problem with the uniform distribution of some keys [3].
MurmurHash 3
2011, 32 and 128 bit lengths, 32 and 64 bit instructions, speed: 1.87 cycles per byte (for messages 16 bytes long).
1. Uniform distribution: good (in one case out of three “almost suspicious”).
2. Avalanche criterion: excellent.

Cityhash
2011, 32, 64, 128 and 256 bit lengths, 64-bit instructions. A rather complicated function.
For the fastest option, you need CRC32 processor instructions (SSE 4.2 - Intel Nehalem).
1. Uniform distribution: good (in one case out of three “almost suspicious”).
2. Avalanche criterion: good. There is a slight problem with the lower bits.
Spookyhash
2011, 32, 64 and 128 bit lengths, speed: 0.33 cycles per byte. Posted by Bob Jenkins.
1. Uniform distribution: average (in one case of the three “suspicious”).
2. Avalanche criterion: excellent.
Statistical analysis of hash functions can be done using tools [17]; analysis results [2], [4], [18] were used.
CityHash is more complex and has more bandwidth (speed on long messages) than MurmurHash. MurmurHash is simpler, has a low latency for short messages (<20 bytes).
CityHash uses CRC instructions for new processors, providing up to 0.17 cycles per byte, which is faster than all other hashes; for comparison, SHA-1 - 4.6 cycles per byte (Ivy Bridge). Without the support of these instructions, CityHash is twice as slow as SpookyHash (according to the author of the latter).
For a 32-bit platform, MurmurHash 3 can be faster than SpookyHash and CityHash.
Perfect hash function
If the message array is known in advance, then you can use the perfect hash function: without collisions and with a search in one comparison operation. Example: gperf.
If the hash function has a normal distribution, then the probability of collision will be as follows:
- for 32 bits: with a probability of 1 in 10 000 there will be a collision among 927 hashes, with a probability of 1 in 100 there will be a collision among 9 292 hashes,
- for 64 bits: with a probability of 1 in 10 000 there will be a collision among 60.7 million hashes, with a probability of 1 in 100 there will be a collision among 609 million hashes.

How real are the collisions? For 32-bit hashes, they occur regularly. For example, CRC32 matches “codding” and “gnu”, “exhibiters” and “schlager”, “curl” and “natural”, etc. This is not the fault of the statistical features of the hash function, but the consequence of the limited size of the hash - only 4 bytes.
All listed functions, except for CRC32 and FNV-1, have good statistics. For a 32-bit system, MurmurHash will probably be the fastest. If the system cannot do unaligned memory reads, such as ARM or SPARC, then MurmurHash may also work the fastest. The speed of CityHash is highly dependent on the compiler, since it uses a lot of complex code, it can be either more or less than the rest.
Recommendations
As a general-purpose hash function, you can use MurmurHash 3, CityHash, SpookyHash V2, gperf.

DoS attack on the hash table
You can imagine a DoS attack when an attacker feeds data into a hash table so that they all fall into one slot, and the worst speed is to search for one element - O (n), insert each new collision - O (n²), the memory size grows ( hash-flooding). To carry out an attack, you need to be able to put arbitrary data in a hash table, and find collisions of the hash function.

Collisions can be created in several ways:
- if the hash function is known in advance and the hash length is small, then use the brute force method: calculate and search in memory meet-in-the-middle.
- if the hash function is not resistant to collisions, then collisions can be calculated mathematically. Non-cryptographic functions are quite simple, and to search for collisions it is often enough to execute the algorithm in the reverse order ([6], oCERT [7]).
We list the properties of cryptographic hash functions (not applicable to general purpose hash functions):
- resistance to restoration of the pre-image (given m , it is impossible to find X so that H ( X ) = m ),
- resistance to restoration of the second prototype (given m 1 , it is impossible to find m 2 so that H ( m 1 ) = H ( m 2 )),
- resistance to collision search (it is impossible to find m 1 and m 2 so that H ( m 1 ) = H ( m 2 )).
Sometimes the last two criteria are called resistance to collisions of the first and second kind, respectively. We will call a collision only the case when neither a message nor a hash is initially set.
Usually the “lightest” attack is a search for collisions. Due to the “birthdays” paradox, the complexity of this attack is at least two times less than the length of the hash. It is this attack that exists for SHA-1: with the theoretical complexity of the collision O (2 80 ), the attack reduces the complexity to O (2 61 ). There are no attacks to restore the first or second prototype for SHA-1.
To increase resistance to collisions, you can:
- Use a cryptographic hash function to calculate a hash of sufficient length. It will be resistant to collisions, but if collisions are once calculated, they can be used to attack any system.
- use HMAC (MAC based cryptographic hash function).
- use a MAC based on PRF or a universal one-way hash function.
A MAC based on a cryptographic function has the following properties:
- you can’t find the key, having the ability to get a hash for any message,
- You cannot find the correct hash for a message without knowing the key.
- greater resistance to collisions, since the family of hash functions is not known to the attacker in advance.
If we decide to use a cryptographic hash function (for example, SHA-3) or a MAC based on it, then we will encounter the following disadvantages:
- too slow for hash tables: the algorithm is designed for long messages, and works slowly for short messages,
- although the full length of the hash is resistant to collisions, but if we take only the first n bits, it will simplify the search for collisions,
- in the case of HMAC, cryptographic hash functions are used that perform unnecessary work to ensure collision resistance, although this is not necessary because the secret key is used.
In addition to the message, the universal hash function also receives a key, using which it selects a hash function from a finite family of functions. With the key known to the attacker, the function must be resistant to restoring the second prototype, but there is no requirement for resistance to collisions. If the key is unknown to the attacker, then the function will be resistant to collisions. Given our speed requirements, it is advisable to do without cryptographic functions.
To complicate the search for collisions, sometimes they try to build a MAC based on general-purpose hash functions using seed as the key. But the resulting "MAC" will not have the required properties, since it is not built on the basis of a cryptographic function. The following attacks are possible:
- create collisions without knowledge of the seed, using mathematical analysis of the hash function,
- find out the seed, having access to the result of the hash function (for example, the program returns data sorted by default hash, this provides a data leak about the hash value),
- Get data through timing attack.
Therefore, although this approach can complicate the life of a random attacker, it often does not withstand the simplest cryptanalysis (oCERT [8]). It has been shown that most general-purpose hash functions are not resistant to cryptographic attacks:
- xxHash is not resistant to restoring a second prototype even without knowing the key, knowing the hash [12].
- MurmurHash 2, MurmurHash 3, CityHash 1.0.3 are not resistant to collisions without knowing the key and hash: a way was found to quickly find collisions that are independent of the choice of seed (“universal” multicollisions) [13].
- Python hash is not resistant to key recovery, knowing the result of a hash function, after which it is possible to carry out an attack against a 64-bit state using meet-in-the-middle [13].
- Python hash is not collision resistant without knowing the key and hash [14].
Until 2008-2010, most languages and frameworks used fairly simple hash functions:
- PHP 4, ASP.NET and JavaScript: DJBX33X,
- PHP 5, Ruby 1.8 and Java: DJBX33A,
- Rust: DJB (2012)
- Python: modified FNV,
- JRuby: MurmurHash2,
- Haskell: FNV-1,
- Redis: djbhash, later MurmurHash 2.
Exception: Perl 4, used One-at-a-Time with random seed since 2003 (this attack was discussed at the conference this year), later (5.17.6) used MurmurHash 3.
To reduce the number of collisions and complicate the DoS attack, many languages over the past year have switched to modern hash functions with random seeds. And when it turned out that this did not solve the problem with the attack, then many changed the hash a second time.
Collision Resistant MAC Example :
- VMAC, UMAC, Poly1305-AES - use AES-128, are optimal for long messages.
- SipHash - is a PRF, does not use cryptographic algorithms, is optimal for short messages.
- “Strongly universal string hashing is fast” [10].
Siphash
2012, the length of 64 bits, the key is 128 bits.
Speed: 15.50 cycles per byte for 8 bytes of a message, 3.66 cycles per byte for 64 bytes of a message, up to 1.96 cycles per byte for long messages. For comparison, the fastest finalist SHA-3 BLAKE-512: 134 cycles per byte for 8 bytes, 20 cycles per byte for 64 bytes.
You can use any number of compression and finalization rounds. For example, SipHash-2-4 - 2 and 4 rounds respectively.
It was designed as a collision-free hash function based on cryptographic pseudo-random function (PRF), fast for short messages. Since it is intended to be used with a random key, there is no requirement for resistance to collisions. It differs from non-cryptographic hash functions in that it works correctly and safely with the key, providing both MAC properties and properties of a universal hash function.

The language situation currently looks like this:
- Rust: SipHash 2-4,
- Rails: SipHash 2-4,
- Perl (5.8.1): SipHash 2-4 (64 bits) and One-at-a-time (32 bits),
- Ruby (1.9.3-p327), JRuby: SipHash 2-4,
- Python (2.7.3, 3.2.3): its own hash function, when -R is turned on, a random seed is added [11],
- Haskell (1.2): SipHash (for strings),
- Go: aeshash (supported by AES-NI) or a simple native function (variation of FNV-1),
- PHP (5): DJBX33A, there is a limit on the number of GET / POST elements (max_input_vars),
- Java SE (7u6 and 8): when jdk.map.althashing.threshold is enabled, MurmurHash with random seed (alternative string hashing) is used,
- Scala: Byteswap32 and MurmurHash 3 or hash from Java,
- .NET (4.5): when enabling UseRandomizedStringHashAlgorithm, Marvin32's own development is used [9].
Open Tasks:
- Java: no reaction to MurmurHash cryptanalysis [16],
- Python: recognized that the “-R” switch is useless, and are pondering what is the best way to fix a hash table attack [14]
- Go: implemented aeshash, but there is no cryptanalysis yet [15].
Other protection methods
It is believed that the fight against this attack is not the task that the hash function should perform. You can instead:
- Use data structures with less complexity in the worst case. For example, a balanced tree — an AVL tree or a red-black tree — guarantees O (log n ) in any case . It is important to consider new attacks, for example, cascading rebalance of a tree with specially generated data, but even in this case we will get a better result - O ( n log n ),
- limit the number of processed elements (for example, POST / GET),
- raise an exception if the number of collisions is greater than the limit (or simply discard data if possible),
- raise an exception if the CPU is busy more than normal for timeout operations.
Recommendations
For data structures that use untrusted data, apply algorithms with complexity O (log n ), for example, a balanced tree. If not possible, then use a hash function randomly selected from a family of functions (MAC), for example, VMAC or SipHash.
Sources
[1] www.isthe.com/chongo/tech/comp/fnv/index.html
[2] Pluto Scarab - Hash Functions
[3] code.google.com/p/smhasher/wiki/MurmurHash2Flaw
[4] “Choosing a Good Hash Function, Part 3 - AK Tech Blog »- blog.aggregateknowledge.com/2012/02/02/choosing-a-good-hash-function-part-3
[5] tools.ietf.org/html/draft- eastlake-fnv-05
[6] www.ocert.org/advisories/ocert-2011-003.html
[7] www.nruns.com/_downloads/advisory28122011.pdf
[8] www.ocert.org/advisories/ocert- 2012-001.html
[9] github.com/floodyberry/Marvin32/blob/master/Marvin32.c
[10] “Strongly universal string hashing is fast” - arxiv.org/abs/1202.4961
[11] “Issue 13703: Hash collision security issue” - bugs.python.org/issue13703
[12] crypto.stackexchange.com/questions/6408/from-hash-to-cryptographic-hash
[13] 131002.net/siphash / # at
[14] “Issue 14621: Hash function is not randomized properly” - bugs.python.org/issue14621
[15] “Issue 4604 - go - runtime: switch to a fast, collision-resistant hash function” - code. google.com/p/go/issues/detail?id=4604
[16] "Bug 880705 - CVE-2012-5373 java: Murmur hash function collisions (oCERT-2012-001)" - bugzilla.redhat.com/show_bug. cgi? id = 880705
[17] code.google.com/p/smhasher/wiki/SMHasher
[18] floodyberry.com/noncryptohashzoo
[2] Pluto Scarab - Hash Functions
[3] code.google.com/p/smhasher/wiki/MurmurHash2Flaw
[4] “Choosing a Good Hash Function, Part 3 - AK Tech Blog »- blog.aggregateknowledge.com/2012/02/02/choosing-a-good-hash-function-part-3
[5] tools.ietf.org/html/draft- eastlake-fnv-05
[6] www.ocert.org/advisories/ocert-2011-003.html
[7] www.nruns.com/_downloads/advisory28122011.pdf
[8] www.ocert.org/advisories/ocert- 2012-001.html
[9] github.com/floodyberry/Marvin32/blob/master/Marvin32.c
[10] “Strongly universal string hashing is fast” - arxiv.org/abs/1202.4961
[11] “Issue 13703: Hash collision security issue” - bugs.python.org/issue13703
[12] crypto.stackexchange.com/questions/6408/from-hash-to-cryptographic-hash
[13] 131002.net/siphash / # at
[14] “Issue 14621: Hash function is not randomized properly” - bugs.python.org/issue14621
[15] “Issue 4604 - go - runtime: switch to a fast, collision-resistant hash function” - code. google.com/p/go/issues/detail?id=4604
[16] "Bug 880705 - CVE-2012-5373 java: Murmur hash function collisions (oCERT-2012-001)" - bugzilla.redhat.com/show_bug. cgi? id = 880705
[17] code.google.com/p/smhasher/wiki/SMHasher
[18] floodyberry.com/noncryptohashzoo