Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In parallel and distributed computing, failure isn't a possibility—it's an inevitability. When you're coordinating hundreds or thousands of nodes, hardware fails, networks partition, and processes crash. The AP exam tests your understanding of how systems maintain correctness and availability despite these failures. You're being tested on the fundamental tradeoffs between consistency, availability, performance, and resource overhead that every distributed system must navigate.
These techniques represent different strategies for the same core problem: keeping systems running when things go wrong. Some approaches prevent failures from causing problems (redundancy), others detect and recover from failures (checkpointing), and still others ensure agreement even when nodes misbehave (Byzantine fault tolerance). Don't just memorize what each technique does—understand which failure scenarios each technique addresses and why you'd choose one over another.
These techniques prevent failures from causing system-wide problems by maintaining multiple copies of data, processes, or components. The core principle: if one copy fails, others remain available to serve requests.
Compare: Replication vs. Process Pairs—both maintain redundant copies, but replication focuses on data availability while process pairs focus on computation continuity. Process pairs also enable error detection through output comparison, which simple replication doesn't provide.
When failures do occur, these techniques enable systems to recover to a known good state rather than starting from scratch. The core principle: periodically save progress so you don't lose everything when something crashes.
Compare: Checkpointing vs. Error Correction Codes—checkpointing recovers from process-level failures by restoring saved state, while error correction codes handle data-level corruption by reconstructing original bits. Both reduce work lost to failures but operate at different abstraction levels.
In distributed systems, nodes must agree on shared state even when some nodes fail or messages get lost. The core principle: achieve agreement among functioning nodes despite partial failures, ensuring all nodes see a consistent view.
Compare: Consensus Algorithms vs. Byzantine Fault Tolerance—standard consensus (Paxos, Raft) assumes nodes either work correctly or crash (crash-fault tolerance), while BFT handles nodes that actively misbehave. BFT requires more nodes and more message rounds, making it slower but necessary when you can't trust all participants.
These techniques use independent implementations or judgments to catch errors that would slip past a single version. The core principle: if multiple independent sources agree, the result is probably correct.
Compare: N-Version Programming vs. Replication—replication protects against hardware failures by running the same code on multiple machines, while N-version programming protects against software bugs by running different code that should produce the same results. Use replication for availability; use N-version for correctness.
Before you can recover from failures, you need to know they've occurred. These techniques identify problems quickly and prevent cascade failures. The core principle: detect failures fast and isolate their impact.
Compare: Failure Detection vs. Load Balancing—failure detection identifies when nodes fail so recovery can begin, while load balancing prevents failures by ensuring nodes aren't overwhelmed. Detection is reactive; load balancing is proactive. Both are essential for robust systems.
| Concept | Best Examples |
|---|---|
| Data/Process Redundancy | Replication, Redundancy (HW/SW), Process Pairs |
| State Recovery | Checkpointing, Error Correction Codes |
| Crash-Fault Consensus | Paxos, Raft |
| Byzantine/Malicious Faults | Byzantine Fault Tolerance |
| Software Fault Tolerance | N-Version Programming |
| Failure Detection | Heartbeats, Timeouts, Watchdog Timers |
| Failure Prevention | Load Balancing |
| Immediate Failover | Process Pairs, Replication |
Which two techniques both use redundancy but protect against different types of failures? Explain what failure type each addresses and why one can't substitute for the other.
A distributed database must remain available even if up to 2 nodes act maliciously and send false data. What technique is required, and what's the minimum number of nodes needed?
Compare coordinated and uncoordinated checkpointing. When would you choose each approach, and what tradeoff are you making?
An FRQ describes a system where three independent teams implemented the same sorting algorithm, and the system runs all three versions and compares outputs. What technique is this, and what type of fault does it protect against that simple replication cannot?
Why might aggressive failure detection (short timeouts) cause problems in a distributed system? Describe a scenario where this leads to incorrect behavior.