.. _Hashing: .. raw:: html .. |--| unicode:: U+2013 .. en dash .. |---| unicode:: U+2014 .. em dash, trimming surrounding whitespace :trim: .. odsalink:: AV/Hashing/hashIntroCON.css .. odsalink:: AV/Hashing/openhashCON.css .. odsalink:: AV/Hashing/buckethashCON.css .. odsalink:: AV/Hashing/linProbeCON.css .. odsalink:: AV/Hashing/collisionCON.css .. This file is part of the OpenDSA eTextbook project. See .. http://algoviz.org/OpenDSA for more details. .. Copyright (c) 2012-2013 by the OpenDSA Project Contributors, and .. distributed under an MIT open source license. .. avmetadata:: :author: Cliff Shaffer ======= Hashing ======= Hashing ------- Hashing (1) ~~~~~~~~~~~ Hashing: The process of mapping a key value to a position in a table. A hash function maps key values to positions. It is denoted by :math:`h`. A hash table is an array that holds the records. It is denoted by **HT**. **HT** has :math:`M` slots, indexed form 0 to :math:`M-1`. Hashing (2) ~~~~~~~~~~~ For any value :math:`K` in the key range and some hash function :math:`h`, :math:`h(K) = i`, :math:`0 <= i < M`, such that key(HT[i]) :math:`= K`. Hashing is appropriate only for sets (no duplicates). Good for both in-memory and disk-based applications. Answers the question "What record, if any, has key value K?" Simple Examples ~~~~~~~~~~~~~~~ .. inlineav:: hashIntroCON ss :long_name: Hashing Intro Slideshow :output: show * More reasonable example: * Store about 1000 records with keys in range 0 to 16,383. * Impractical to keep a hash table with 16,384 slots. * We must devise a hash function to map the key range to a smaller table. Collisions (1) ~~~~~~~~~~~~~~ * Given: hash function **h** with keys :math:`k_1` and :math:`k_2`. :math:`\beta` is a slot in the hash table. * If :math:`\mathbf{h}(k_1) = \beta = \mathbf{h}(k_2)`, then :math:`k_1` and :math:`k_2` have a collision at :math:`\beta` under **h**. * Search for the record with key :math:`K`: #. Compute the table location :math:`\mathbf{h}(K)`. #. Starting with slot :math:`\mathbf{h}(K)`, locate the record containing key :math:`K` using (if necessary) a collision resolution policy. Collisions (2) ~~~~~~~~~~~~~~ * Collisions are inevitable in most applications. * Example: Among 23 people, some pair is likely to share a birthday. .. avembed:: AV/Hashing/Birthday.html pe :module: Hashing Hash Functions (1) ~~~~~~~~~~~~~~~~~~ * A hash function MUST return a value within the hash table range. * To be practical, a hash function SHOULD evenly distribute the records stored among the hash table slots. * Ideally, the hash function should distribute records with equal probability to all hash table slots. In practice, success depends on distribution of actual records stored. Hash Functions (2) ~~~~~~~~~~~~~~~~~~ * If we know nothing about the incoming key distribution, evenly distribute the key range over the hash table slots while avoiding obvious opportunities for clustering. * If we have knowledge of the incoming distribution, use a distribution-dependent hash function. Simple Mod Function ~~~~~~~~~~~~~~~~~~~ :: int h(int x) { return x % 16; } .. inlineav:: hashFuncExCON1 ss :long_name: Hash Function Slideshow 1 :output: show Binning ~~~~~~~ .. inlineav:: hashFuncExCON2 ss :long_name: Hash Function Slideshow 2 :output: show Mod vs. Binning ~~~~~~~~~~~~~~~ .. odsafig:: Images/HashNormal.png :width: 750 :align: center :capalign: center :figwidth: 90% :alt: Binning vs. Mod Function Mid-Square Method ~~~~~~~~~~~~~~~~~ .. odsafig:: Images/MidSquare.png :width: 100 :align: center :capalign: justify :figwidth: 90% :alt: Mid-square method example .. avembed:: AV/Hashing/MidSquare.html pe :module: Hashing Strings Function: Character Adding ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :: int sascii(String x, int M) { char ch[]; ch = x.toCharArray(); int xlength = x.length(); int i, sum; for (sum=0, i=0; i < x.length(); i++) sum += ch[i]; return sum % M; } .. avembed:: AV/Hashing/StringSimple.html pe :module: Hashing String Folding ~~~~~~~~~~~~~~ .. codeinclude:: Hashing/sfold :tag: sfold .. avembed:: AV/Hashing/StringSfold.html pe :module: Hashing Open Hashing ~~~~~~~~~~~~ .. inlineav:: openhashCON dgm Bucket Hashing (1) ~~~~~~~~~~~~~~~~~~ .. inlineav:: buckethashCON1 ss :long_name: Bucket Hashing Slideshow 1 :output: show Bucket Hashing (2) ~~~~~~~~~~~~~~~~~~ .. inlineav:: buckethashCON2 ss :long_name: Bucket Hashing Slideshow 2 :output: show Closed Hashing ~~~~~~~~~~~~~~ * Closed hashing stores all records directly in the hash table. * Each record :math:`i` has a home position :math:`\mathbf{h}(k_i)`. * If another record occupies the home position for :math:`i`, then another slot must be found to store :math:`i`. * The new slot is found by a collision resolution policy. * Search must follow the same policy to find records not in their home slots. Collision Resolution ~~~~~~~~~~~~~~~~~~~~ * During insertion, the goal of collision resolution is to find a free slot in the table. * Probe sequence: The series of slots visited during insert/search by following a collision resolution policy. * Let :math:`\beta_0 = \mathbf{h}(K)`. Let :math:`(\beta_0, \beta_1, ...)` be the series of slots making up the probe sequence. Insertion ~~~~~~~~~ :: // Insert e into hash table HT void hashInsert(const Key& k, const Elem& e) { int home; // Home position for e int pos = home = h(k); // Init probe sequence for (int i=1; EMPTYKEY != (HT[pos]).key(); i++) { pos = (home + p(k, i)) % M; // probe if (k == HT[pos].key()) { println("Duplicates not allowed"); return; } } HT[pos] = e; } Search ~~~~~~ :: // Search for the record with Key K bool hashSearch(const Key& K, Elem& e) const { int home; // Home position for K int pos = home = h(K); // Initial position is the home slot for (int i = 1; (K != (HT[pos]).key()) && (EMPTYKEY != (HT[pos]).key()); i++) pos = (home + p(K, i)) % M; // Next on probe sequence if (K == (HT[pos]).key()) { // Found it e = HT[pos]; return true; } else return false; // K not in hash table } Probe Function ~~~~~~~~~~~~~~ * Look carefully at the probe function p():: pos = (home + p(k, i)) % M; // probe * Each time p() is called, it generates a value to be added to the home position to generate the new slot to be examined. * :math:`p()` is a function both of the element's key value, and of the number of steps taken along the probe sequence. Not all probe functions use both parameters. Linear Probing (1) ~~~~~~~~~~~~~~~~~~ * Use the following probe function:: p(K, i) = i; * Linear probing simply goes to the next slot in the table. * Past bottom, wrap around to the top. * To avoid infinite loop, one slot in the table must always be empty. Linear Probing (2) ~~~~~~~~~~~~~~~~~~ .. inlineav:: linProbeCON1 ss :long_name: Linear Probing Slideshow 1 :output: show Problem with Linear Probing ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. inlineav:: linProbeCON2 ss :long_name: Linear Probing Slideshow 2 :output: show * The primary goal of a collision resolution mechanism: * Give each empty slot of the table an equal probability of receiving the next record. Linear Probing by Steps (1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Instead of going to the next slot, skip by some constant c. * Warning: Pick M and c carefully. .. inlineav:: collisionCON1 ss :long_name: Linear Probing By Steps Slideshow 1 :output: show * This effectively splits the key range, and the hash table, into two halves. This leads to reduced performance. Linear Probing by Steps (2) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ * The probe sequence SHOULD cycle through all slots of the table. * Pick :math:`c` to be relatively prime to :math:`M`. .. inlineav:: collisionCON2 ss :long_name: Linear Probing By Steps Slideshow 2 :output: show Pseudo-Random Probing (1) ~~~~~~~~~~~~~~~~~~~~~~~~~ .. inlineav:: collisionCON3 ss :long_name: Pseudo-Random Probing Slideshow :output: show Pseudo-Random Probing (2) ~~~~~~~~~~~~~~~~~~~~~~~~~ .. inlineav:: collisionCON4 ss :long_name: Avoiding the Train :output: show Quadratic Probing ~~~~~~~~~~~~~~~~~ .. inlineav:: collisionCON5 ss :long_name: Quadratic Probing Slideshow :output: show .. inlineav:: collisionCON6 ss :long_name: Quadratic Probing Problem :output: show Double Hashing (1) ~~~~~~~~~~~~~~~~~~ .. inlineav:: collisionCON7 ss :long_name: Double Hashing Slideshow 2 :output: show Double Hashing (2) ~~~~~~~~~~~~~~~~~~ .. inlineav:: collisionCON8 ss :long_name: Double Hashing Slideshow 3 :output: show Analysis of Closed Hashing ~~~~~~~~~~~~~~~~~~~~~~~~~~ The load factor is :math:`\alpha = N/M` where :math:`N` is the number of records currently in the table. .. odsafig:: Images/hashplot.png :width: 600 :align: center :capalign: justify :figwidth: 90% :alt: Hashing analysis plot Deletion ~~~~~~~~ * Deleting a record must not hinder later searches. * We do not want to make positions in the hash table unusable because of deletion. * Both of these problems can be resolved by placing a special mark in place of the deleted record, called a tombstone. * A tombstone will not stop a search, but that slot can be used for future insertions. Tombstones (1) ~~~~~~~~~~~~~~ .. inlineav:: hashdelCON ss :long_name: Hash Deletion Slideshow :output: show Tombstones (2) ~~~~~~~~~~~~~~ * Unfortunately, tombstones add to the average path length. * Solutions: #. Local reorganizations to try to shorten the average path length. #. Periodically rehash the table (by order of most frequently accessed record). .. odsascript:: AV/Hashing/hashIntroCON.js .. odsascript:: AV/Hashing/hashFuncExCON1.js .. odsascript:: AV/Hashing/hashFuncExCON2.js .. odsascript:: AV/Hashing/openhashCON.js .. odsascript:: AV/Hashing/buckethashCON1.js .. odsascript:: AV/Hashing/buckethashCON2.js .. odsascript:: AV/Hashing/linProbeCON1.js .. odsascript:: AV/Hashing/linProbeCON2.js .. odsascript:: AV/Hashing/collisionCON1.js .. odsascript:: AV/Hashing/collisionCON2.js .. odsascript:: AV/Hashing/collisionCON3.js .. odsascript:: AV/Hashing/collisionCON4.js .. odsascript:: AV/Hashing/collisionCON5.js .. odsascript:: AV/Hashing/collisionCON6.js .. odsascript:: AV/Hashing/collisionCON7.js .. odsascript:: AV/Hashing/collisionCON8.js .. odsascript:: AV/Hashing/hashdelCON.js