upgrade
upgrade

💻Parallel and Distributed Computing

Fault Tolerance Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In parallel and distributed computing, failure isn't a possibility—it's an inevitability. When you're coordinating hundreds or thousands of nodes, hardware fails, networks partition, and processes crash. The AP exam tests your understanding of how systems maintain correctness and availability despite these failures. You're being tested on the fundamental tradeoffs between consistency, availability, performance, and resource overhead that every distributed system must navigate.

These techniques represent different strategies for the same core problem: keeping systems running when things go wrong. Some approaches prevent failures from causing problems (redundancy), others detect and recover from failures (checkpointing), and still others ensure agreement even when nodes misbehave (Byzantine fault tolerance). Don't just memorize what each technique does—understand which failure scenarios each technique addresses and why you'd choose one over another.


Redundancy-Based Approaches

These techniques prevent failures from causing system-wide problems by maintaining multiple copies of data, processes, or components. The core principle: if one copy fails, others remain available to serve requests.

Replication

  • Creates multiple copies of data or processes across different nodes, ensuring the system survives individual node failures
  • Synchronous replication guarantees all copies update together (strong consistency), while asynchronous replication allows delayed updates (better performance, weaker consistency)
  • Enables automatic failover—when one replica fails, the system redirects to a healthy copy without user-visible interruption

Redundancy (Hardware and Software)

  • Duplicates critical components at multiple levels—from disk drives (RAID) to entire data centers—providing alternatives during failures
  • Hardware redundancy includes redundant power supplies, network paths, and storage; software redundancy includes backup services and standby processes
  • Increases both reliability and availability but comes with cost and complexity tradeoffs that system designers must balance

Process Pairs

  • Runs two identical processes on separate nodes that continuously monitor each other's execution
  • Provides immediate failover—if the primary process fails, the backup takes over with minimal interruption
  • Enables error detection through comparison—discrepancies between outputs can reveal corruption or bugs before they cause larger problems

Compare: Replication vs. Process Pairs—both maintain redundant copies, but replication focuses on data availability while process pairs focus on computation continuity. Process pairs also enable error detection through output comparison, which simple replication doesn't provide.


State Recovery Techniques

When failures do occur, these techniques enable systems to recover to a known good state rather than starting from scratch. The core principle: periodically save progress so you don't lose everything when something crashes.

Checkpointing and Rollback Recovery

  • Periodically saves process state to stable storage, creating restore points the system can return to after failures
  • Coordinated checkpointing synchronizes all processes (simpler recovery but higher overhead); uncoordinated checkpointing lets processes checkpoint independently (lower overhead but complex recovery)
  • Reduces work lost during failures from "everything since startup" to "everything since last checkpoint"—a critical tradeoff between checkpoint frequency and runtime overhead

Error Correction Codes

  • Adds redundant bits to data that allow detection and correction of transmission or storage errors without retransmission
  • Hamming codes correct single-bit errors; Reed-Solomon codes handle burst errors in storage systems and communications
  • Enables transparent recovery from common hardware-level errors—your data arrives correct even when the underlying medium introduces noise

Compare: Checkpointing vs. Error Correction Codes—checkpointing recovers from process-level failures by restoring saved state, while error correction codes handle data-level corruption by reconstructing original bits. Both reduce work lost to failures but operate at different abstraction levels.


Consensus and Agreement Protocols

In distributed systems, nodes must agree on shared state even when some nodes fail or messages get lost. The core principle: achieve agreement among functioning nodes despite partial failures, ensuring all nodes see a consistent view.

Consensus Algorithms (Paxos, Raft)

  • Enable distributed nodes to agree on a single value even when some nodes crash or messages are delayed—essential for leader election, configuration changes, and replicated state machines
  • Raft prioritizes understandability with clear leader election and log replication phases; Paxos is more general but notoriously difficult to implement correctly
  • Guarantee safety (nodes never disagree) and liveness (the system eventually makes progress) under specific failure assumptions

Byzantine Fault Tolerance

  • Handles arbitrary failures including nodes that lie, send conflicting messages, or act maliciously—not just nodes that crash silently
  • Requires at least 3f+13f + 1 nodes to tolerate ff Byzantine faults, since the system needs enough honest nodes to outvote malicious ones
  • Essential for trustless environments like blockchain systems where you can't assume all participants are honest—but comes with significant performance overhead

Compare: Consensus Algorithms vs. Byzantine Fault Tolerance—standard consensus (Paxos, Raft) assumes nodes either work correctly or crash (crash-fault tolerance), while BFT handles nodes that actively misbehave. BFT requires more nodes and more message rounds, making it slower but necessary when you can't trust all participants.


Diversity and Voting Approaches

These techniques use independent implementations or judgments to catch errors that would slip past a single version. The core principle: if multiple independent sources agree, the result is probably correct.

N-Version Programming

  • Develops multiple independent implementations of the same specification, reducing the chance that all versions share the same bug
  • Compares outputs and uses majority voting to select the correct result—if two of three versions agree, that answer is used
  • Addresses software faults that replication can't handle—running the same buggy code on multiple nodes just replicates the bug

Compare: N-Version Programming vs. Replication—replication protects against hardware failures by running the same code on multiple machines, while N-version programming protects against software bugs by running different code that should produce the same results. Use replication for availability; use N-version for correctness.


Detection and Prevention Mechanisms

Before you can recover from failures, you need to know they've occurred. These techniques identify problems quickly and prevent cascade failures. The core principle: detect failures fast and isolate their impact.

Failure Detection Mechanisms

  • Heartbeat messages let nodes periodically announce they're alive; missing heartbeats trigger failure suspicion after a timeout
  • Watchdog timers reset when processes complete expected actions—if the timer expires, the process is assumed hung or crashed
  • Must balance detection speed against false positives—aggressive timeouts catch failures quickly but may incorrectly declare healthy-but-slow nodes as failed

Load Balancing

  • Distributes work across nodes to prevent any single node from becoming overwhelmed and failing under load
  • Static load balancing assigns work based on predefined rules; dynamic load balancing adjusts in real-time based on current node health and capacity
  • Prevents cascade failures—when one node fails, good load balancing ensures remaining nodes can absorb the extra work without themselves becoming overloaded

Compare: Failure Detection vs. Load Balancing—failure detection identifies when nodes fail so recovery can begin, while load balancing prevents failures by ensuring nodes aren't overwhelmed. Detection is reactive; load balancing is proactive. Both are essential for robust systems.


Quick Reference Table

ConceptBest Examples
Data/Process RedundancyReplication, Redundancy (HW/SW), Process Pairs
State RecoveryCheckpointing, Error Correction Codes
Crash-Fault ConsensusPaxos, Raft
Byzantine/Malicious FaultsByzantine Fault Tolerance
Software Fault ToleranceN-Version Programming
Failure DetectionHeartbeats, Timeouts, Watchdog Timers
Failure PreventionLoad Balancing
Immediate FailoverProcess Pairs, Replication

Self-Check Questions

  1. Which two techniques both use redundancy but protect against different types of failures? Explain what failure type each addresses and why one can't substitute for the other.

  2. A distributed database must remain available even if up to 2 nodes act maliciously and send false data. What technique is required, and what's the minimum number of nodes needed?

  3. Compare coordinated and uncoordinated checkpointing. When would you choose each approach, and what tradeoff are you making?

  4. An FRQ describes a system where three independent teams implemented the same sorting algorithm, and the system runs all three versions and compares outputs. What technique is this, and what type of fault does it protect against that simple replication cannot?

  5. Why might aggressive failure detection (short timeouts) cause problems in a distributed system? Describe a scenario where this leads to incorrect behavior.