upgrade
upgrade

☁️Cloud Computing Architecture

Disaster Recovery Strategies

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Disaster recovery isn't just about backing up your files—it's about understanding the trade-offs between cost, recovery speed, and data loss tolerance that drive every architectural decision in cloud computing. You're being tested on your ability to match business requirements to appropriate recovery strategies, which means knowing why a pilot light approach costs less than a hot site, or when synchronous replication is worth the performance overhead.

The core concepts here revolve around RTO and RPO metrics, resource scaling patterns, and data consistency models. Every strategy you learn exists on a spectrum from cheap-but-slow to expensive-but-instant. Don't just memorize the names—know what concept each strategy illustrates and when you'd recommend one over another in a real-world scenario.


Defining Your Recovery Requirements

Before selecting any strategy, you need to establish measurable targets. These metrics form the foundation of every disaster recovery architecture and directly influence which solutions are viable for a given business case.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

  • RTO defines maximum acceptable downtime—if your business can tolerate 4 hours offline, your RTO is 4 hours, and every architectural choice must meet this threshold
  • RPO specifies maximum data loss tolerance—an RPO of 1 hour means you can lose up to 1 hour of transactions; this directly determines your replication strategy
  • Lower RTO/RPO values exponentially increase costs—achieving near-zero values requires hot standby infrastructure and synchronous replication

Disaster Recovery Planning and Testing

  • Comprehensive documentation outlines recovery procedures—includes runbooks, responsibility matrices, and communication protocols for each failure scenario
  • Regular testing validates actual recovery capabilitiestabletop exercises, simulation drills, and full failover tests reveal gaps between planned and actual RTO/RPO
  • Stakeholder training ensures execution under pressure—untested plans fail when humans panic; automation reduces error rates during actual incidents

Compare: RTO vs. RPO—both are time-based metrics, but RTO measures downtime while RPO measures data loss. If an exam question describes a scenario, identify which metric the business prioritizes to determine the appropriate strategy.


Low-Cost Recovery Approaches

These strategies minimize infrastructure costs by accepting longer recovery times. They're ideal when budget constraints outweigh the need for instant failover, or when the business can tolerate hours of downtime.

Backup and Restore

  • Creates point-in-time copies of data and applications—the most basic DR strategy with the longest recovery time but lowest ongoing costs
  • Backup types affect storage and recovery speedfull backups capture everything, incremental captures changes since last backup, differential captures changes since last full backup
  • Regular restoration testing is non-negotiable—backups that can't be restored are worthless; schedule quarterly recovery drills to validate integrity

Pilot Light

  • Maintains minimal core infrastructure in standby mode—database servers and critical components stay running, but application tiers remain off until needed
  • Faster than backup/restore, cheaper than warm standbyyou're pre-provisioning the slowest-to-launch components while keeping compute costs low
  • Scaling automation is essential—without pre-configured auto-scaling, manual intervention delays recovery beyond acceptable RTO thresholds

Compare: Backup and Restore vs. Pilot Light—both are cost-conscious approaches, but pilot light keeps critical infrastructure "warm" for faster recovery. Choose backup/restore when RTO tolerance exceeds 24 hours; choose pilot light for 1-4 hour RTO requirements.


Balanced Recovery Approaches

These strategies offer middle-ground solutions that balance cost against recovery speed. They're the most common choice for production workloads with moderate availability requirements.

Warm Standby

  • Runs a scaled-down but functional environment continuously—all components are operational at reduced capacity, ready to scale up during failover
  • Recovery measured in minutes, not hourstraffic redirection and auto-scaling are the primary delays, not infrastructure provisioning
  • Cost-effective for sub-hour RTO requirements—you're paying for baseline compute continuously but avoiding full duplicate infrastructure costs

Cloud-based Disaster Recovery

  • Leverages cloud elasticity for on-demand recovery resources—pay for standby infrastructure only when activated, reducing idle costs
  • Eliminates physical infrastructure managementno hardware procurement, no data center contracts, just API calls to provision recovery environments
  • Global cloud regions enable geographic distribution—deploy recovery environments in different regions without building physical facilities

Compare: Warm Standby vs. Cloud-based DR—warm standby is a capacity pattern while cloud-based DR is a deployment model. You can implement warm standby using cloud-based DR; they're not mutually exclusive. FRQ tip: describe how cloud elasticity reduces the cost of maintaining warm standby environments.


High-Availability Recovery Approaches

These strategies prioritize minimal downtime and near-zero data loss. They require significant infrastructure investment but are essential for mission-critical systems where even minutes of downtime cause substantial business impact.

Hot Site / Multi-Site

  • Fully operational duplicate environment running continuously—both sites handle production traffic, with either capable of absorbing full load during failure
  • Supports near-zero RTO through active-active configurationno failover delay because both sites are already serving users
  • Highest cost due to complete infrastructure duplication—you're essentially paying for two production environments to achieve maximum resilience

Data Replication

  • Copies data between locations in real-time or near real-time—ensures standby environments have current data when failover occurs
  • Synchronous replication guarantees zero data losswrites aren't acknowledged until confirmed at both locations, but adds latency to every transaction
  • Asynchronous replication trades data loss risk for performance—writes complete immediately while replication happens in background; RPO equals replication lag

Failover and Failback

  • Failover redirects traffic to standby systems automatically—triggered by health checks detecting primary system failure; automation is critical for meeting aggressive RTO targets
  • Failback returns operations to primary after restoration—often more complex than failover due to data synchronization requirements
  • Testing both directions prevents surprises—many organizations test failover but neglect failback, leading to extended outages when returning to normal operations

Compare: Synchronous vs. Asynchronous Replication—synchronous guarantees RPO of zero but adds write latency; asynchronous maintains performance but accepts potential data loss. If an FRQ describes a financial trading system, synchronous is likely required; for content delivery, asynchronous is usually acceptable.


Geographic Distribution Strategies

Distributing resources across physical locations protects against regional disasters and improves availability. These strategies complement the recovery approaches above by addressing where your infrastructure lives.

Geographical Redundancy

  • Distributes resources across multiple physical regions—protects against natural disasters, power grid failures, and regional network outages
  • Increases complexity of data consistency managementCAP theorem trade-offs become relevant when data must remain consistent across distant locations
  • Regulatory requirements may mandate specific geographic placement—data sovereignty laws require certain data to remain within national boundaries even for DR copies

Compare: Multi-Site vs. Geographical Redundancy—multi-site is an active-active deployment pattern while geographical redundancy is a risk mitigation strategy. You can have geographical redundancy with a pilot light approach (passive standby in another region) without running multi-site active-active.


Quick Reference Table

ConceptBest Examples
Lowest Cost / Longest RecoveryBackup and Restore
Cost-Optimized with Faster RecoveryPilot Light, Cloud-based DR
Balanced Cost and SpeedWarm Standby
Minimal Downtime / Highest CostHot Site / Multi-Site
Zero Data Loss RequirementSynchronous Data Replication
Performance-Optimized ReplicationAsynchronous Data Replication
Regional Disaster ProtectionGeographical Redundancy
Recovery MetricsRTO (downtime), RPO (data loss)

Self-Check Questions

  1. A company has an RTO of 15 minutes and an RPO of zero. Which two strategies must they implement together, and why?

  2. Compare pilot light and warm standby approaches: what infrastructure components are running in each, and how does this affect recovery time?

  3. An e-commerce platform processes $50,000\$50,000 in transactions per hour. If their current backup strategy has a 4-hour RPO, what is their maximum potential data loss in dollar terms, and which replication strategy would reduce this?

  4. Explain why failback is often more complex than failover, and describe what testing should occur before returning to the primary system.

  5. A startup wants disaster recovery but has limited budget. They can tolerate 4 hours of downtime but cannot lose more than 30 minutes of data. Which combination of strategies would you recommend, and how do they address each metric separately?