Fiveable

๐Ÿ”Data Structures Unit 9 Review

QR code for Data Structures practice questions

9.3 Hash Table Implementation and Analysis

9.3 Hash Table Implementation and Analysis

Written by the Fiveable Content Team โ€ข Last updated August 2025
Written by the Fiveable Content Team โ€ข Last updated August 2025
๐Ÿ”Data Structures
Unit & Topic Study Guides

Hash Table Implementation

Hash tables map keys to array indices using hash functions, giving you average-case O(1)O(1) lookups, insertions, and deletions. That speed advantage makes them one of the most widely used data structures in practice. The tradeoff: you need to handle collisions (when two keys land on the same index) and manage the table's size as it fills up.

Hash table implementation with arrays

A hash table is backed by an array. A hash function takes a key and returns an index into that array. The quality of the hash function directly determines how evenly keys spread across the table.

Common hash functions:

  • Modulo hash function: hash(key)=keyโ€Šmodโ€Šmhash(key) = key \bmod m, where mm is the table size. Simple and fast. Choosing a prime number for mm helps distribute keys more evenly.
  • Multiplication hash function: hash(key)=โŒŠmโ‹…(kโ‹…Aโ€Šmodโ€Š1)โŒ‹hash(key) = \lfloor m \cdot (k \cdot A \bmod 1) \rfloor, where AA is a constant between 0 and 1. Knuth suggests Aโ‰ˆ0.6180339887A \approx 0.6180339887 (the golden ratio conjugate). This method is less sensitive to the choice of mm.
  • Universal hashing: Randomly select a hash function from a designed family of functions at runtime. This gives provably low expected collisions regardless of the input distribution.

Collision resolution is needed because two different keys can hash to the same index. There are two main strategies:

Separate chaining stores a linked list (or other collection) at each index. When a collision occurs, the new key is appended to the list at that index. Lookups walk the list to find the target key. The lists grow dynamically, so the table never technically "fills up," but long chains hurt performance.

Open addressing stores all entries directly in the array. When a collision occurs, you probe for the next available slot:

  1. Linear probing: Check index h+1h + 1, then h+2h + 2, and so on (wrapping around). Simple, but tends to create clusters of occupied slots, which slows future insertions and lookups.
  2. Quadratic probing: Check index h+12h + 1^2, then h+22h + 2^2, then h+32h + 3^2, etc. Reduces clustering compared to linear probing, but can fail to find an empty slot if the table is more than half full (depending on table size).
  3. Double hashing: Use a second hash function h2(key)h_2(key) to determine the step size. Probe sequence: h+iโ‹…h2(key)h + i \cdot h_2(key) for i=0,1,2,โ€ฆi = 0, 1, 2, \ldots This produces more uniform probe sequences and avoids clustering.

Time complexity of hash operations

Average case: O(1)O(1) for insertion, deletion, and search. This assumes a good hash function that distributes keys uniformly and a load factor that stays reasonable (typically ฮฑ<0.75\alpha < 0.75).

Worst case: O(n)O(n) for all three operations. This happens when every key hashes to the same index. With chaining, you'd be searching a single linked list of length nn. With open addressing, you'd probe through nearly every slot.

The load factor ฮฑ\alpha controls how full the table is:

ฮฑ=nm\alpha = \frac{n}{m}

where nn is the number of stored keys and mm is the table size. As ฮฑ\alpha increases, collisions become more frequent. For separate chaining, ฮฑ\alpha can exceed 1 (more keys than slots), but performance degrades. For open addressing, ฮฑ\alpha must stay below 1, and performance drops sharply as it approaches 1. A common threshold is ฮฑ=0.75\alpha = 0.75 for triggering a resize.

Hash table implementation with arrays, Fundamentals of data structures: Hashing - Wikibooks, open books for an open world

Hash Table Optimization and Applications

Rehashing and dynamic resizing

When the load factor crosses a threshold, the table needs to grow. This process is called rehashing:

  1. Allocate a new array, typically double the current size.
  2. Recompute the hash for every existing key using the new table size (since mm changed, all indices change).
  3. Insert each key into the new array.
  4. Deallocate the old array.

Rehashing is O(n)O(n) since you touch every key. But it happens infrequently enough that the amortized cost of insertion remains O(1)O(1). Think of it this way: you do nn cheap insertions, then one expensive O(n)O(n) rehash, which averages out to O(1)O(1) per insertion.

Some implementations also shrink the table (e.g., halve it when ฮฑ\alpha drops below 0.25) to avoid wasting memory after many deletions.

Hash table implementation with arrays, Hash tables explained [step-by-step example] ยท YourBasic

Hash tables vs other data structures

|Hash Table|Array|Balanced BST (AVL, Red-Black)| |---|---|---|---| | Search | O(1)O(1) avg, O(n)O(n) worst | O(n)O(n) (unsorted), O(logโกn)O(\log n) (sorted) | O(logโกn)O(\log n) worst | | Insert | O(1)O(1) avg, O(n)O(n) worst | O(n)O(n) (shifting) | O(logโกn)O(\log n) worst | | Delete | O(1)O(1) avg, O(n)O(n) worst | O(n)O(n) (shifting) | O(logโกn)O(\log n) worst | |Ordered iteration|Not supported|By index only|Yes (in-order traversal)| | Range queries | Not efficient | Not efficient | Efficient |

The key takeaway: hash tables win on raw speed for individual key lookups. Balanced BSTs win when you need sorted order or range queries (e.g., "find all keys between 50 and 100"). Arrays are best when you're accessing elements by numeric index.

Applications of hash tables

Caching: Store recently or frequently accessed data in a hash table keyed by the request (URL, query, etc.). Instead of hitting a database or making a network call, you check the cache first. If the key exists, you return the cached result in O(1)O(1).

Indexing: Build a lookup structure that maps search terms to their locations. For example, an inverted index maps each word in a document collection to the list of documents containing that word. Search engines rely heavily on this pattern.

Counting frequencies: Use a hash table where keys are elements and values are counts. To count word frequencies in a text:

  1. For each word, check if it exists in the table.
  2. If yes, increment its count.
  3. If no, insert it with a count of 1.

This runs in O(n)O(n) for nn words, since each lookup and update is O(1)O(1) on average. The same pattern applies to detecting duplicates, computing histograms, and grouping data by category.