Resilience is the ability of a software system to withstand and recover from failures, maintaining acceptable service levels even when components fail, networks partition, or load spikes occur.
Resilience is a fundamental quality attribute of robust software systems that goes beyond simple fault tolerance. While fault tolerance focuses on preventing failures, resilience encompasses the system's ability to detect, respond to, and recover from failures while continuing to provide service. A resilient system doesn't just survive failures—it degrades gracefully, isolates problems, and automatically restores normal operation. This quality is essential in distributed systems where failures are not exceptional events but expected occurrences.
Fault Isolation: Failures are contained within a component and don't cascade to other parts of the system. Circuit breakers prevent calls to failing services; bulkheads isolate resource pools.
Graceful Degradation: The system continues to function at reduced capacity rather than failing completely. Core features remain available while non-essential features may be temporarily disabled.
Self-Healing: The system automatically detects and recovers from failures without manual intervention. Health checks, auto-restart, and failover mechanisms restore service.
Elasticity: The system automatically scales resources up and down based on demand, absorbing load spikes without degradation.
Observability: Comprehensive monitoring, logging, and tracing provide visibility into system health, enabling detection of issues before they impact users.
Redundancy: Critical components have backups—multiple instances, data replicas, and fallback paths ensure single points of failure are eliminated.
Circuit Breaker: Prevents repeated calls to a failing service, allowing it time to recover. Three states: closed (normal operation), open (failing fast), half-open (testing recovery).
Retry with Backoff: Automatically retries transient failures (network timeouts, temporary unavailability) with increasing delays between attempts.
Bulkhead: Isolates resources for different components, preventing failure in one area from consuming all resources and starving others. Like compartments in a ship's hull.
Timeout: Sets maximum wait times for operations, preventing indefinite hangs that could tie up resources.
Fallback: Provides alternative responses when a service fails—cached data, default values, or degraded functionality.
Health Check: Regular probes that verify component health, enabling automated recovery actions like restarting unhealthy instances.
Rate Limiting: Protects the system from overload by limiting incoming request rates, ensuring fair resource allocation.
Netflix is a canonical example of a resilient system at massive scale. During AWS outages, Netflix has remained available by designing for failure at every level. Their architecture includes Chaos Monkey, which deliberately kills production instances to test resilience. Circuit breakers prevent cascading failures across microservices. Client-side load balancing ensures requests find healthy instances. Regional fallbacks allow serving from alternate regions during major outages. This resilience allows Netflix to maintain service even when underlying infrastructure fails—exactly what users expect from a premium service.
Mean Time Between Failures (MTBF): Average time between failures. Higher is better.
Mean Time to Recover (MTTR): Average time to restore service after failure. Lower is better.
Error Rate: Percentage of requests that fail. Should remain within acceptable thresholds.
Availability (SLA): Percentage of time the system is operational. 99.9% ("three nines") allows 8.76 hours of downtime per year; 99.99% allows 52.6 minutes.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Lower RPO means stricter durability requirements.
Recovery Time Objective (RTO): Maximum acceptable downtime. Lower RTO means faster recovery requirements.