10.7. Improved Collision Resolution — CS3 Data Structures & Algorithms

10.7.4. Double Hashing¶

Both pseudo-random probing and quadratic probing eliminate primary clustering, which is the name given to the the situation when keys share substantial segments of a probe sequence. If two keys hash to the same home position, however, then they will always follow the same probe sequence for every collision resolution method that we have seen so far. The probe sequences generated by pseudo-random and quadratic probing (for example) are entirely a function of the home position, not the original key value. This is because function p ignores its input parameter \(K\) for these collision resolution methods. If the hash function generates a cluster at a particular home position, then the cluster remains under pseudo-random and quadratic probing. This problem is called secondary clustering.

To avoid secondary clustering, we need to have the probe sequence make use of the original key value in its decision-making process. A simple technique for doing this is to return to linear probing by a constant step size for the probe function, but to have that constant be determined by a second hash function, \(\textbf{h}_2\). Thus, the probe sequence would be of the form \(\textbf{p}(K, i) = i * \textbf{h}_2(K)\). This method is called double hashing.

There are important restrictions on \(h_2\). Most importantly, the value returned by \(h_2\) must never be zero (or \(M\)) because that will immediately lead to an infinite loop as the probe sequence makes no progress. However, a good implementation of double hashing should also ensure that all of the probe sequence constants are relatively prime to the table size \(M\). For example, if the hash table size were 100 and the step size for linear probing (as generated by function \(h_2\)) were 50, then there would be only one slot on the probe sequence. If instead the hash table size is 101 (a prime number), than any step size less than 101 will visit every slot in the table.

This can be achieved easily. One way is to select \(M\) to be a prime number, and have \(\textbf{h}_2\) return a value in the range \(1 <= \textbf{h}_2(k) <= M - 1\). We can do this by using this secondary hash function: \(\textbf{h}_2(k) = 1 + (k \mod (M-1))\). An alternative is to set \(M = 2^m\) for some value \(m\) and have \(\textbf{h}_2\) return an odd value between 1 and \(2^m\). We can get that result with this secondary hash function: \(\textbf{h}_2(k) = (((k/M) \mod (M/2)) * 2) + 1\). [1]

Settings

Saving...

Server Error
Resubmit

Settings

Saving...

Server Error
Resubmit

Now you can try it.

[1]

The secondary hash function \(\textbf{h}_2(k) = (((k/M) \mod (M/2)) * 2) + 1\) might seem rather mysterious, so let's break this down. This is being used in the context of two facts: (1) We want the function to return an odd value that is less than \(M\) the hash table size, and (2) we are using a hash table of size \(M = 2^m\), which means that taking the mod of size \(M\) is using the bottom \(m\) bits of the key value. OK, since \(\textbf{h}_2\) is multiplying something by 2 and adding 1, we guarentee that it is an odd number. Now, \(((X \mod (M/2)) * 2) + 1\) must be in the range 1 and \(M-1\) (if you need to, play around with this on paper to convince yourself that this is true). This is exactly what we want. The last piece of the puzzle is the first part \(k/M\). That is not strictly necessary. But remember that since the table size is \(M = 2^m\), this is the same as shifting the key value right by \(m\) bits. In other words, we are not using the bottom \(m\) bits to decide on the second hash function value, which is especially a good thing if we used the bottom \(m\) bits to decide on the first hash function value! In other words, we really do not want the value of the step sized used by the linear probing to be fixed to the slot in the hash table that we chose. So we are using the next \(m\) bits of the key value instead. Note that this would only be a good idea if we have keys in a large enough key range, that is, we want plenty of use of those second \(m\) bits in the key range. This will be true if the max key value uses at least \(2m\) bits, meaning that the max key value should be at least the square of the hash table size. This is not a problem for typical hashing applications.