๐Ÿ’ปParallel and Distributed Computing

Fault Tolerance Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In parallel and distributed computing, failure isn't a possibility; it's an inevitability. When you're coordinating hundreds or thousands of nodes, hardware fails, networks partition, and processes crash. Understanding how systems maintain correctness and availability despite these failures is central to this topic. You need to grasp the fundamental tradeoffs between consistency, availability, performance, and resource overhead that every distributed system must navigate.

These techniques represent different strategies for the same core problem: keeping systems running when things go wrong. Some approaches prevent failures from causing problems (redundancy), others detect and recover from failures (checkpointing), and still others ensure agreement even when nodes misbehave (Byzantine fault tolerance). Don't just memorize what each technique does. Understand which failure scenarios each technique addresses and why you'd choose one over another.


Redundancy-Based Approaches

These techniques prevent failures from causing system-wide problems by maintaining multiple copies of data, processes, or components. The core principle: if one copy fails, others remain available to serve requests.

Replication

  • Creates multiple copies of data or processes across different nodes, so the system survives individual node failures.
  • Synchronous replication guarantees all copies update together (strong consistency), while asynchronous replication allows delayed updates (better performance, weaker consistency). The choice between them is one of the most common design tradeoffs you'll encounter.
  • Enables automatic failover: when one replica fails, the system redirects to a healthy copy without user-visible interruption.

Redundancy (Hardware and Software)

  • Duplicates critical components at multiple levels, from disk drives (RAID) to entire data centers, providing alternatives during failures.
  • Hardware redundancy includes redundant power supplies, network paths, and storage. Software redundancy includes backup services and standby processes ready to activate when a primary fails.
  • Increases both reliability and availability but comes with cost and complexity tradeoffs. More redundancy means more money and more state to keep synchronized.

Process Pairs

  • Runs two identical processes on separate nodes that continuously monitor each other's execution.
  • Provides immediate failover: if the primary process fails, the backup takes over with minimal interruption. This is different from replication because the backup is actively tracking the primary's state, not just holding a copy of data.
  • Enables error detection through comparison: discrepancies between outputs can reveal corruption or bugs before they cause larger problems.

Compare: Replication vs. Process Pairs โ€” both maintain redundant copies, but replication focuses on data availability while process pairs focus on computation continuity. Process pairs also enable error detection through output comparison, which simple replication doesn't provide.


State Recovery Techniques

When failures do occur, these techniques enable systems to recover to a known good state rather than starting from scratch. The core principle: periodically save progress so you don't lose everything when something crashes.

Checkpointing and Rollback Recovery

Checkpointing works like saving your progress in a video game. Without it, a crash means restarting from the very beginning.

  • Periodically saves process state to stable storage, creating restore points the system can return to after failures.
  • Coordinated checkpointing synchronizes all processes so they checkpoint at the same logical time. This makes recovery simpler because all saved states are mutually consistent, but it forces every process to pause together, adding overhead.
  • Uncoordinated checkpointing lets processes checkpoint independently. This has lower runtime overhead, but recovery is more complex because you may need to roll back multiple processes to find a set of checkpoints that are mutually consistent (avoiding the domino effect, where rolling back one process forces another to roll back, which forces another, and so on).
  • The key tradeoff is checkpoint frequency vs. work lost: checkpoint too often and you waste time saving state; checkpoint too rarely and you lose more work per failure.

Error Correction Codes

  • Adds redundant bits to data that allow detection and correction of transmission or storage errors without retransmission.
  • Hamming codes correct single-bit errors. Reed-Solomon codes handle burst errors (multiple consecutive corrupted bits) in storage systems and communications.
  • Enables transparent recovery from common hardware-level errors. Your data arrives correct even when the underlying medium introduces noise. This happens constantly in RAM (ECC memory) and network communication without you ever noticing.

Compare: Checkpointing vs. Error Correction Codes โ€” checkpointing recovers from process-level failures by restoring saved state, while error correction codes handle data-level corruption by reconstructing original bits. Both reduce work lost to failures but operate at very different abstraction levels.


Consensus and Agreement Protocols

In distributed systems, nodes must agree on shared state even when some nodes fail or messages get lost. The core principle: achieve agreement among functioning nodes despite partial failures, so all nodes see a consistent view.

Consensus Algorithms (Paxos, Raft)

  • Enable distributed nodes to agree on a single value even when some nodes crash or messages are delayed. This is essential for leader election, configuration changes, and replicated state machines.
  • Raft was designed to be understandable. It breaks consensus into clear sub-problems: leader election, log replication, and safety. Paxos is more general but notoriously difficult to implement correctly. In practice, many real systems use Raft or Raft-inspired protocols.
  • Both guarantee safety (nodes never disagree on a committed value) and liveness (the system eventually makes progress), provided a majority of nodes are functioning. A cluster of 5 nodes can tolerate 2 crashes and keep working.

Byzantine Fault Tolerance

  • Handles arbitrary failures including nodes that lie, send conflicting messages, or act maliciously. This goes far beyond crash failures, where a node simply stops responding.
  • Requires at least 3f+13f + 1 nodes to tolerate ff Byzantine faults. So tolerating just 1 malicious node requires at least 4 nodes; tolerating 2 requires 7. The system needs enough honest nodes to outvote malicious ones.
  • Essential for trustless environments like blockchain systems where you can't assume all participants are honest. The tradeoff is significant performance overhead: BFT protocols require many more message exchanges per decision than crash-fault protocols.

Compare: Consensus Algorithms vs. Byzantine Fault Tolerance โ€” standard consensus (Paxos, Raft) assumes nodes either work correctly or crash (crash-fault tolerance), while BFT handles nodes that actively misbehave. BFT requires more nodes (3f+13f + 1 vs. 2f+12f + 1) and more message rounds, making it slower but necessary when you can't trust all participants.


Diversity and Voting Approaches

These techniques use independent implementations or judgments to catch errors that would slip past a single version. The core principle: if multiple independent sources agree, the result is probably correct.

N-Version Programming

  • Develops multiple independent implementations of the same specification, reducing the chance that all versions share the same bug. For example, three separate teams might each write their own flight control system from the same requirements document.
  • Compares outputs and uses majority voting to select the correct result. If two of three versions agree, that answer is used.
  • Addresses software faults that replication can't handle. Running the same buggy code on multiple nodes just replicates the bug on every node. N-version programming uses different code, so a bug in one version is unlikely to appear in the others.

Compare: N-Version Programming vs. Replication โ€” replication protects against hardware failures by running the same code on multiple machines, while N-version programming protects against software bugs by running different code that should produce the same results. Use replication for availability; use N-version for correctness.


Detection and Prevention Mechanisms

Before you can recover from failures, you need to know they've occurred. These techniques identify problems quickly and prevent cascade failures. The core principle: detect failures fast and isolate their impact.

Failure Detection Mechanisms

  • Heartbeat messages let nodes periodically announce they're alive. Missing heartbeats trigger failure suspicion after a timeout. For example, if Node A expects a heartbeat from Node B every 2 seconds and hasn't received one in 6 seconds, it suspects Node B has failed.
  • Watchdog timers reset when processes complete expected actions. If the timer expires without being reset, the process is assumed hung or crashed. These are common in embedded systems and operating system kernels.
  • Must balance detection speed against false positives. Aggressive timeouts catch failures quickly but may incorrectly declare healthy-but-slow nodes as failed (for instance, a node experiencing temporary network congestion). This can trigger unnecessary failovers that actually make things worse.

Load Balancing

  • Distributes work across nodes to prevent any single node from becoming overwhelmed and failing under load.
  • Static load balancing assigns work based on predefined rules (like round-robin). Dynamic load balancing adjusts in real-time based on current node health and capacity, which is more adaptive but adds monitoring overhead.
  • Prevents cascade failures. When one node fails, the remaining nodes must absorb its work. Good load balancing ensures this extra work is spread evenly so no single node gets crushed and fails in turn, creating a chain reaction.

Compare: Failure Detection vs. Load Balancing โ€” failure detection identifies when nodes fail so recovery can begin, while load balancing prevents failures by ensuring nodes aren't overwhelmed. Detection is reactive; load balancing is proactive. Both are essential for robust systems.


Quick Reference Table

ConceptBest Examples
Data/Process RedundancyReplication, Redundancy (HW/SW), Process Pairs
State RecoveryCheckpointing, Error Correction Codes
Crash-Fault ConsensusPaxos, Raft
Byzantine/Malicious FaultsByzantine Fault Tolerance
Software Fault ToleranceN-Version Programming
Failure DetectionHeartbeats, Timeouts, Watchdog Timers
Failure PreventionLoad Balancing
Immediate FailoverProcess Pairs, Replication

Self-Check Questions

  1. Which two techniques both use redundancy but protect against different types of failures? Explain what failure type each addresses and why one can't substitute for the other.

  2. A distributed database must remain available even if up to 2 nodes act maliciously and send false data. What technique is required, and what's the minimum number of nodes needed?

  3. Compare coordinated and uncoordinated checkpointing. When would you choose each approach, and what tradeoff are you making?

  4. An exam question describes a system where three independent teams implemented the same sorting algorithm, and the system runs all three versions and compares outputs. What technique is this, and what type of fault does it protect against that simple replication cannot?

  5. Why might aggressive failure detection (short timeouts) cause problems in a distributed system? Describe a scenario where this leads to incorrect behavior.