Designing a system for 1 million requests per second requires a layered, distributed architecture leveraging horizontal scaling, caching, asynchronous processing, database sharding, and global edge distribution, with each layer optimized for its specific bottleneck.
A system handling 1 million requests per second (1M RPS) represents an extreme scale that demands careful architectural choices at every layer. No single server, database, or caching tier can sustain this load. The architecture must be horizontally distributed across thousands of servers, with intelligent partitioning of both compute and data. The design must assume component failures are normal and build redundancy, graceful degradation, and automatic recovery into every layer.
At this scale, several principles become non-negotiable. First, the system must be stateless at the application tier to allow unlimited horizontal scaling. Second, data must be partitioned (sharded) across many databases to distribute write load. Third, caching must be pervasive, sitting at multiple layers (CDN, edge, application, database). Fourth, asynchronous processing decouples request handling from background work. Fifth, every operation must be designed for failure, with retries, circuit breakers, and fallbacks built in.
Edge Layer (CDN + Global Load Balancing): Use a global CDN (CloudFront, Fastly) to cache static assets and even API responses for non-personalized content. DNS-level global load balancing (Route53 Geolocation, Anycast) routes users to the nearest regional deployment, reducing latency. This layer absorbs 50-80% of requests before they hit origin.
API Gateway (Regional): Deploy API gateways in each region (Kong, Envoy, or cloud-native solutions). Gateways handle SSL termination, request routing, rate limiting (per user/per IP), authentication, and request validation. They also aggregate logs and metrics.
Application Layer (Stateless Services): Services must be stateless; no session data stored locally. Use horizontal scaling with auto-scaling groups. Target each service to handle its own request volume. For 1M RPS, you might need thousands of pods across multiple availability zones. Use Kubernetes for orchestration with HPA (Horizontal Pod Autoscaler) configured to scale based on CPU/memory or custom metrics.
Cache Layer: Multiple caching tiers: Edge cache for static content, CDN for API responses with short TTLs, application-level cache (Redis clusters) for database query results, session data, and computed values. Redis clusters should be partitioned (sharded) with read replicas. Cache hit rates must exceed 90% to protect databases.
Database Layer: Single database cannot handle 1M writes/sec. Implement sharding (horizontal partitioning) based on a high-cardinality shard key (e.g., user_id). Use a distributed database like CockroachDB, Spanner, or Vitess (for MySQL) that handles sharding transparently. Each shard should have its own read replicas for read scaling. For analytical queries, use a separate data warehouse.
Message Queues and Async Processing: Offload non-critical operations to message queues (Kafka, SQS). For example, after an order is placed, queue email notifications, inventory updates, and analytics. This reduces request latency and smooths traffic spikes.
Observability: At 1M RPS, manual debugging is impossible. Implement distributed tracing (Jaeger, Zipkin) with 1% sampling to keep overhead manageable. Collect metrics at every layer (Prometheus, Datadog). Centralized logging (ELK stack, Loki) with structured logs. Real-time dashboards and automated alerting must detect anomalies before they cascade.
Global Distribution: Use geographic DNS and regional deployments to serve users from nearby locations. This reduces latency and distributes load. Consider active-active multi-region with conflict resolution for writes.
Database Write Scaling: Use sharding to distribute writes. Choose a shard key that provides uniform distribution (avoid hotspots). Consider event sourcing where writes are appended to immutable logs, then aggregated asynchronously.
Cold Start for Auto-Scaling: At 1M RPS, scaling must be preemptive. Use predictive scaling based on historical patterns rather than reactive scaling. Keep a minimum buffer of instances to absorb spikes.
Network Bandwidth: At scale, network throughput becomes a constraint. Use 10/25/100 Gbps networking. Co-locate services that communicate heavily (use pod affinity in Kubernetes). Compress payloads (gzip, brotli).
Deployment and Rollback: Deployments must be zero-downtime with canary or blue-green patterns. Automated rollback on error thresholds. Feature flags to gradually roll out changes. Deploy at low-traffic times and monitor closely.
Cost Optimization: At this scale, efficiency matters. Right-size instances, use spot instances for batch workloads, optimize cache hit rates, compress data, and regularly review resource utilization to eliminate waste.