Hash functions map arbitrary-sized data to fixed-size values, enabling the constant-time lookups that make hash tables so fast. Understanding how to design a good hash function is the foundation for everything else in this unit, because a poorly chosen hash function can degrade a hash table from $O(1)$ average-case operations all the way down to $O(n)$ .

Purpose of Hash Functions

A hash function takes a key of any size and produces a fixed-size integer called a hash code. That hash code determines where the key-value pair gets stored in the table's underlying array.

This is what makes hash tables useful for structures like dictionaries and sets: instead of searching through every element, you compute the hash code, jump straight to the right slot, and get average-case $O(1)$ for insertion, deletion, and lookup.

Beyond hash tables, hash functions show up in database indexing, caching layers, checksums for data integrity, and cryptographic applications. The design goals differ across these contexts, but the core idea is the same: turn data into a compact, fixed-size representation.

Purpose of hash functions, Hash tables explained [step-by-step example] · YourBasic

Uniformity and Efficiency

Two properties matter most when evaluating a hash function: uniformity and efficiency.

Uniformity means the hash function distributes keys evenly across the output range. When distribution is uneven, multiple keys land in the same slot, causing collisions. A collision is when two different keys produce the same hash code. More collisions means more time spent resolving them, which erodes the constant-time guarantee.

You can assess uniformity using statistical tools like the chi-squared test (comparing observed vs. expected slot occupancy) or by monitoring the load factor (number of stored elements divided by table size).

Efficiency means the hash code is fast to compute. If computing the hash takes longer than the lookup it enables, you've defeated the purpose. The ideal is $O(1)$ time and minimal extra space per computation.

Here are three common hash function families and how they balance these properties:

Division method: $h(k) = k \mod m$ . Simple and fast, but if $m$ has small prime factors or is a power of 2, keys with patterns in their lower bits will cluster into the same slots. Choosing $m$ as a prime not close to a power of 2 helps.
Multiplication method: $h(k) = \lfloor m(kA \mod 1) \rfloor$ , where $A$ is a constant between 0 and 1. This gives better distribution than division and is less sensitive to the choice of $m$ . Knuth suggested $A \approx 0.6180339887$ (the golden ratio conjugate) as a good default.
Universal hashing: Instead of picking one fixed function, you randomly select from a family of hash functions at runtime. This provides theoretical worst-case guarantees on collision probability, regardless of the input distribution. The trade-off is slightly more setup overhead.

Designing and Optimizing Hash Functions

Designing for Different Data Types

Different data types need different hashing strategies.

Integers are the simplest case. Modulo-based methods like $h(k) = k \mod m$ work directly. You can also use bit-level operations (XOR, shifts, rotations) to scramble the bits and break up patterns in the input. For example, many practical hash functions XOR a key with a right-shifted version of itself before taking the modulus, which helps spread out clustered values.

Strings require more thought because they're variable-length sequences of characters. Two common approaches:

Polynomial rolling hash: Treat each character as a coefficient of a polynomial. For a string $s_0 s_1 \ldots s_{n-1}$ , compute $h = s_0 \cdot p^{n-1} + s_1 \cdot p^{n-2} + \ldots + s_{n-1}$ , where $p$ is a small prime (like 31 or 37). This is fast to compute incrementally and produces good distribution.
CRC (Cyclic Redundancy Check): Computes the remainder of a polynomial division over the bit representation. More commonly used for error detection, but sometimes applied in hash table contexts.

Composite objects (like a struct with multiple fields) are hashed by combining the hash codes of their individual fields. A typical approach: multiply a running hash by a prime number, then XOR or add the next field's hash code. This prevents different field orderings from producing the same result.

When choosing a hash function, consider:

The expected key distribution. If keys are nearly sequential integers, a simple modulus on a prime works fine. If keys are skewed or patterned, you need a function that actively scrambles those patterns.
Your collision resolution scheme. Chaining is more tolerant of collisions than open addressing, so open addressing benefits more from a higher-quality hash function.
Security requirements. For untrusted input (like user-supplied keys in a web server), use cryptographic hash functions like SHA-256 to prevent adversarial collision attacks. These are much slower but resistant to deliberate exploitation.

Complexity vs. Performance Trade-offs

There's a real tension between hash function quality and speed.

A simple function like $h(k) = k \mod m$ computes in nanoseconds but may produce uneven distributions for certain inputs. A more sophisticated function (multiple rounds of bit mixing, or universal hashing) distributes keys more evenly but takes longer per call. Since the hash function runs on every insertion and lookup, even small per-call costs add up.

Your collision resolution strategy also factors into this decision:

Chaining stores colliding elements in linked lists (or other structures) at each slot. It tolerates higher collision rates gracefully, so you can afford a simpler, faster hash function. The trade-off is extra memory for the list pointers.
Open addressing probes alternative slots within the array when a collision occurs. Clusters of occupied slots slow down probing, so you need a hash function with strong uniformity to avoid clustering. Linear probing is especially sensitive to this.

To find the right balance for a specific application, consider:

The expected number of elements and the table's load factor (keeping it below 0.7 is a common guideline for open addressing)
Whether keys come from a known, predictable distribution or from arbitrary/adversarial sources
Actual benchmarking. Profile different hash functions with realistic data. Theoretical uniformity doesn't always predict real-world performance, because cache behavior and branch prediction also matter.

2,589 studying →