Why This Matters
Distributed algorithms are the backbone of every system where multiple machines need to work together, from cloud databases to blockchain networks to apps syncing across your devices. The core challenges you need to understand: how nodes coordinate without a single point of control, why failures don't crash entire systems, and what mechanisms ensure consistency when machines can't agree on what time it is.
These concepts show up repeatedly on exams because they represent the fundamental problems of building reliable systems at scale. Don't just memorize algorithm names. Know what problem each algorithm solves and why that problem is hard in a distributed context. When you see Paxos, think "agreement despite failures." When you see Chandy-Lamport, think "capturing global state without stopping the world." That conceptual mapping is what separates strong exam answers from vocabulary recognition.
Achieving Agreement: Consensus and Coordination
When distributed nodes must agree on a single value or decision, you need consensus algorithms. The core challenge: nodes can fail, messages can be delayed or lost, and yet the system must still make progress and never contradict itself.
Consensus Algorithms (Paxos, Raft)
Consensus means all non-faulty nodes agree on the same value, even when some nodes crash or network partitions occur.
- Paxos provides the theoretical foundation. It defines three roles: proposers (suggest values), acceptors (vote on proposals), and learners (learn the decided value). A value is chosen once a majority of acceptors agree. Paxos is provably correct but notoriously difficult to implement because the original paper left many practical details unspecified.
- Raft was designed specifically for understandability. It breaks consensus into three subproblems: leader election, log replication, and safety. A single leader accepts client requests, appends them to its log, and replicates entries to followers. This makes reasoning about correctness much easier. Raft powers real systems like etcd and Consul.
Leader Election Algorithms
Leader election designates a single coordinator to make decisions, avoiding conflicts when multiple nodes might act simultaneously. Once a leader exists, other nodes follow rather than negotiating every decision, which dramatically reduces coordination overhead.
- The Bully algorithm selects the node with the highest ID. When a node detects the leader has failed, it sends election messages to all higher-ID nodes. If none respond, it declares itself leader. If a higher-ID node responds, that node takes over the election.
- The Ring algorithm passes election messages around a logical ring. Each node forwards the highest ID it has seen. When a message completes the ring, the node with the highest ID becomes leader.
Byzantine Fault Tolerance Algorithms
Byzantine fault tolerance (BFT) handles nodes that lie, send conflicting messages, or behave unpredictably. This is a strictly harder problem than crash failures, because you can't simply assume a silent node has failed.
- PBFT (Practical Byzantine Fault Tolerance) requires 3f+1 total nodes to tolerate f Byzantine failures. It uses a three-phase protocol (pre-prepare, prepare, commit) where nodes exchange signed messages and proceed only when enough matching responses arrive.
- BFT is the foundation of blockchain consensus. Systems like Hyperledger Fabric use PBFT variants to ensure agreement even when some participants are malicious.
Compare: Raft vs. PBFT. Both achieve consensus, but Raft assumes nodes fail only by crashing (crash fault tolerance), while PBFT assumes nodes might actively lie (Byzantine fault tolerance). For blockchain or adversarial environments, you need PBFT. For trusted internal systems, Raft is simpler and faster.
Managing Shared Resources: Mutual Exclusion and Synchronization
When multiple processes need exclusive access to a shared resource, you need mechanisms to prevent simultaneous access. The challenge: coordinating without a central lock server that could become a bottleneck or single point of failure.
Distributed Mutual Exclusion
Distributed mutual exclusion prevents race conditions by ensuring only one process enters a critical section at any time.
- The Ricart-Agrawala algorithm uses timestamped requests. A process broadcasts a request to all other processes and waits until every one of them replies with permission. If two processes request simultaneously, the one with the earlier timestamp wins; the other defers its reply until it finishes its own critical section.
- The Token Ring approach circulates a single token around a logical ring. Only the token holder can access the resource. This is simple but vulnerable to token loss: if the token-holding node crashes, the system needs a recovery mechanism to regenerate the token.
Clock Synchronization Algorithms
Distributed nodes have no shared memory and no shared clock, so ordering events across nodes requires explicit synchronization.
- NTP (Network Time Protocol) synchronizes node clocks to external time sources, achieving millisecond-level accuracy by accounting for network round-trip delays.
- The Berkeley Algorithm takes a different approach: a coordinator polls all nodes, averages their clock values (discarding outliers), and sends correction offsets. This works well in LANs where no external time source is available.
- Lamport timestamps (logical clocks) provide event ordering without physical synchronization. Each process maintains a counter, incremented on every event and updated on message receipt. The guarantee: if event A causally precedes event B, then timestamp(A)<timestamp(B). Note that the converse isn't necessarily true: a lower timestamp doesn't guarantee causal precedence. Vector clocks extend this idea to capture the full causal relationship, using a vector of counters (one per process) so you can determine whether two events are causally related or concurrent.
Compare: Ricart-Agrawala vs. Token Ring. Ricart-Agrawala requires 2(nโ1) messages per critical section entry (nโ1 requests plus nโ1 replies). Token Ring has lower message complexity in the best case (just passing the token) but potentially higher latency, since a process must wait for the token to arrive even if no one else needs the resource. Choose based on whether message overhead or response time matters more for your scenario.
Capturing and Sharing State: Snapshots and Dissemination
Distributed systems need ways to understand global state and spread information efficiently. The challenge: there's no global clock or instantaneous communication, so you can't just "pause" everything to take a picture.
Distributed Snapshot Algorithms
The Chandy-Lamport algorithm captures a consistent global state without halting system execution. Here's how it works:
- An initiating process records its own local state and sends a special marker message on all its outgoing channels.
- When a process receives a marker on a channel for the first time, it records its own state (if it hasn't already) and begins recording all messages arriving on that channel.
- When a process receives a marker on a channel it's already recording, it stops recording that channel. The recorded messages represent the channel's state.
- Once all processes have recorded their state and all channels are accounted for, the collected data forms a consistent global snapshot.
This is essential for checkpointing (so you can recover from failures), debugging, and detecting stable properties like deadlock or termination.
Gossip Protocols
Gossip protocols spread information epidemic-style: each node periodically selects a random peer and shares its state updates.
- Convergence takes roughly O(logn) rounds to reach all n nodes, making gossip highly scalable since no central coordinator is needed.
- Gossip tolerates node failures gracefully. If one node dies, information still spreads through other paths. This is why systems like Cassandra and DynamoDB rely on gossip for membership detection and state propagation.
- The tradeoff is that gossip provides eventual consistency, not immediate consistency. There's a window where different nodes may have different views of the system state.
Compare: Chandy-Lamport vs. Gossip. Chandy-Lamport captures a single consistent snapshot at one logical moment, while Gossip continuously propagates updates toward eventual consistency. Use snapshots for recovery points and property detection; use gossip for ongoing state synchronization and failure detection.
Organizing Data and Networks: Distributed Structures
Large-scale distributed systems need efficient ways to store data and route messages without centralized indexes. The challenge: maintaining structure and enabling fast lookups when nodes join, leave, and fail constantly.
Distributed Hash Tables (DHTs)
A DHT is a decentralized key-value store where each node is responsible for a portion of the keyspace, determined by consistent hashing.
- Chord organizes nodes in a logical ring. Each node maintains a finger table with pointers to nodes at exponentially increasing distances around the ring, enabling O(logn) lookup time. When a key is requested, the query is forwarded to successively closer nodes until the responsible node is reached.
- Kademlia uses an XOR-based distance metric for routing. The "distance" between two node IDs is their bitwise XOR, and each node maintains routing tables organized by XOR distance. Kademlia's symmetric distance property (the distance from A to B equals the distance from B to A) simplifies routing and improves efficiency.
- DHTs are the foundation of peer-to-peer systems like BitTorrent and IPFS, enabling scalable storage and retrieval without central servers.
Distributed Spanning Tree Algorithms
A spanning tree creates a loop-free topology for efficient broadcast and routing across network nodes.
- The Spanning Tree Protocol (STP) elects a root bridge and prunes redundant links to prevent broadcast storms. Every node can still be reached, but packets follow a single tree path rather than looping endlessly.
- Spanning trees enable efficient resource allocation: messages follow tree edges rather than flooding the entire network, reducing bandwidth usage significantly.
Distributed Graph Algorithms
These algorithms process graph structures that are partitioned across multiple nodes, where no single machine holds the entire graph.
- PageRank iteratively computes node importance through message passing. Each node sends a fraction of its current rank to its neighbors, and ranks are updated until convergence. This is inherently parallelizable since each node's computation depends only on messages from its neighbors.
- Distributed MST algorithms like GHS (Gallager-Humblet-Spira) build minimum spanning trees in O(nlogn) messages by having fragments of the tree merge along minimum-weight outgoing edges.
- These algorithms are critical for social network analysis, web search ranking, and network optimization at scales where centralized processing is impossible.
Compare: DHTs vs. Spanning Trees. DHTs optimize for data lookup and storage distribution, while spanning trees optimize for message routing and broadcast efficiency. A system might use both: a DHT for finding where data lives, and a spanning tree for efficiently reaching all nodes with a broadcast.
Quick Reference Table
|
| Agreement despite crashes | Paxos, Raft, Leader Election |
| Agreement despite malicious nodes | PBFT, Byzantine fault tolerance |
| Exclusive resource access | Ricart-Agrawala, Token Ring |
| Time and ordering | NTP, Berkeley Algorithm, Lamport timestamps, Vector clocks |
| Global state capture | Chandy-Lamport snapshots |
| Scalable information spread | Gossip protocols |
| Decentralized data storage | Chord, Kademlia (DHTs) |
| Network topology management | Spanning Tree Protocol, distributed MST (GHS) |
Self-Check Questions
-
Both Raft and PBFT achieve consensus. What fundamental assumption about node failures distinguishes when you'd use each one?
-
If a system needs to detect whether a distributed computation has terminated, which algorithm family would you use to capture the necessary global state, and why can't you simply query each node individually?
-
Compare Ricart-Agrawala and Token Ring for distributed mutual exclusion: which has better message complexity, and which is more vulnerable to a single point of failure?
-
A peer-to-peer file sharing system needs to locate files across millions of nodes without any central server. Which distributed structure enables O(logn) lookups, and what mechanism determines which node stores which keys?
-
A large-scale database needs nodes to share membership information and detect failures, but centralized coordination would create a bottleneck. Which protocol family achieves eventual consistency through randomized peer-to-peer communication, and what is its typical convergence time in terms of network size?
-
Lamport timestamps guarantee that if event A causally precedes event B, then timestamp(A)<timestamp(B). Why is the converse not true, and what extension to logical clocks solves this limitation?