🏆 Distributed Systems & Cloud Building Blocks Interview Questions Master List
Table of Contents
- Domain Name System (DNS)
- Load Balancer
- Database
- Key-Value Store
- Content Delivery Network (CDN)
- Sequencer
- Service Monitoring
- Distributed Caching
- Distributed Messaging Queue
- Publish-Subscribe System
- Rate Limiter
- Distributed Search
- Distributed Logging
- Distributed Tracing
- Distributed Task Scheduling
- Blob Store
🌐 Domain Name System (DNS)
- Explain how DNS resolution works step by step.
- What is the difference between recursive and iterative DNS queries?
- What are common DNS record types (A, AAAA, CNAME, MX, TXT, SRV)?
- How do DNS caching and TTL (Time-to-Live) impact performance?
- How does Amazon Route 53 provide high availability and latency-based routing?
- What is DNS poisoning/spoofing? How do you prevent it?
- Explain how CDN providers integrate with DNS for load balancing.
- Design a fault-tolerant DNS resolution system for a global e-commerce app.
⚖️ Load Balancer
- What’s the difference between Layer 4 and Layer 7 load balancers?
- How does AWS ELB detect unhealthy instances and reroute traffic?
- Explain sticky sessions and when you’d use them.
- How would you design a load balancing strategy for a real-time chat system?
- What’s the role of reverse proxies like Nginx in load balancing?
- How do load balancers handle SSL/TLS termination?
- Explain how global load balancing differs from regional load balancing.
- How do you design a load balancer setup that supports disaster recovery?
🗄️ Database
- Compare SQL vs NoSQL databases and when to use each.
- How does database replication improve availability?
- What is the difference between master-slave and master-master replication?
- How do sharding and partitioning differ?
- Explain CAP theorem and its application in database systems.
- How does DynamoDB achieve horizontal scalability?
- Design a schema for a high-volume transaction system (like UPI).
- How do you handle database migration in production without downtime?
🔑 Key-Value Store
- How do Redis and DynamoDB differ in design and use cases?
- What are typical eviction policies in Redis?
- How do you ensure consistency in a distributed key-value store?
- Explain write-through, write-back, and write-around caching strategies.
- Design a leaderboard system using Redis sorted sets.
- How do you prevent cache stampede and thundering herd problems?
- What’s the difference between strong and eventual consistency in key-value stores?
- How do you secure Redis or DynamoDB in production?
🚀 Content Delivery Network (CDN)
- How does a CDN reduce latency for end users?
- What’s the difference between push CDN and pull CDN?
- How do you handle cache invalidation in CDN?
- Explain edge caching and its importance in video streaming.
- How does Cloudflare handle DDoS protection via CDN?
- Design a global CDN strategy for an OTT video platform.
- What are challenges in serving dynamic vs static content via CDN?
🔢 Sequencer
- Why do distributed systems need a sequencer?
- Explain Twitter’s Snowflake ID generator design.
- What are trade-offs between UUID and Snowflake IDs?
- How do you design a sequencer to avoid collisions in a high-scale system?
- How do sequencers preserve causality in event processing?
- What are scalability challenges in a centralized sequencer?
- How do you implement distributed ID generation in DynamoDB?
📊 Service Monitoring
- What are key metrics to monitor in distributed systems?
- How do you implement health checks in microservices?
- What’s the difference between black-box and white-box monitoring?
- Explain Prometheus pull vs push metrics collection.
- How do you design an alerting system to avoid alert fatigue?
- How does AWS CloudWatch integrate with auto-scaling?
- What is the difference between logging, monitoring, and observability?
⚡ Distributed Caching
- Why do we need distributed caching instead of local cache?
- How does Redis cluster ensure partition tolerance?
- What are cache invalidation strategies?
- Explain cache-aside vs read-through caching.
- Design a caching strategy for an e-commerce product catalog.
- How do you ensure consistency in multi-node cache clusters?
- What are common pitfalls of caching (e.g., stale data, cache stampede)?
📩 Distributed Messaging Queue
- Compare Kafka vs RabbitMQ vs AWS SQS.
- How does Kafka ensure message durability?
- What is the difference between at-most-once, at-least-once, and exactly-once delivery?
- How does partitioning work in Kafka?
- Design a messaging system for order processing in e-commerce.
- What are dead-letter queues (DLQs) and when do you use them?
- How do you handle backpressure in message queues?
📢 Publish-Subscribe System
- How does pub-sub differ from point-to-point messaging?
- Explain fan-out in pub-sub systems.
- What challenges arise in maintaining message ordering in pub-sub?
- How does Google Pub/Sub scale globally?
- Design an event-driven architecture using SNS + SQS + Lambda.
- What’s the role of Kafka topics and consumer groups in pub-sub?
🚦 Rate Limiter
- Why do we need rate limiting in APIs?
- Compare token bucket vs leaky bucket algorithms.
- How does API Gateway in AWS implement rate limiting?
- Design a rate limiter for login attempts in a banking app.
- How do you prevent abuse when scaling rate limiters across multiple servers?
- What’s the role of Redis in implementing distributed rate limiting?
🔍 Distributed Search
- How does Elasticsearch index data?
- What’s the difference between inverted index and forward index?
- How does distributed search handle sharding and replication?
- Explain full-text search vs keyword search.
- Design a search engine for an e-commerce platform.
- What are common scaling challenges with Elasticsearch?
- How do you implement autocomplete efficiently in search?
📝 Distributed Logging
- Why is centralized logging important in microservices?
- How does ELK stack (Elasticsearch, Logstash, Kibana) work together?
- What’s the difference between structured and unstructured logs?
- How do you design a log aggregation system for 1M+ events/sec?
- What are common challenges in log storage and retention?
- How does CloudWatch Logs differ from Splunk?
🕵️ Distributed Tracing
- What problem does distributed tracing solve?
- Explain how AWS X-Ray works.
- What are spans and traces in OpenTelemetry?
- How do you trace a request across multiple microservices?
- Design a tracing setup for a large-scale payments platform.
- What are challenges in tracing asynchronous calls?
⏳ Distributed Task Scheduling
- What’s the difference between cron jobs and distributed schedulers?
- How does Apache Airflow manage DAG dependencies?
- How do AWS Step Functions handle retries and rollbacks?
- Design a scheduling system for nightly ETL jobs processing TBs of data.
- What challenges arise in distributed task scheduling across regions?
- How do you handle idempotency in scheduled tasks?
🗂️ Blob Storage
- What’s the difference between object storage and block storage?
- How does S3 achieve 11 9’s durability?
- Explain eventual consistency in S3 read-after-write model.
- How do you secure access to blob storage in cloud?
- Design a storage strategy for a video streaming platform.
- What’s the difference between hot, warm, and cold storage tiers?
- How do you handle data lifecycle management in blob storage?