🏆 Distributed Systems & Cloud Building Blocks Interview Questions Master List


Table of Contents
  1. Domain Name System (DNS)
  2. Load Balancer
  3. Database
  4. Key-Value Store
  5. Content Delivery Network (CDN)
  6. Sequencer
  7. Service Monitoring
  8. Distributed Caching
  9. Distributed Messaging Queue
  10. Publish-Subscribe System
  11. Rate Limiter
  12. Distributed Search
  13. Distributed Logging
  14. Distributed Tracing
  15. Distributed Task Scheduling
  16. Blob Store

🌐 Domain Name System (DNS)
  1. Explain how DNS resolution works step by step.
  2. What is the difference between recursive and iterative DNS queries?
  3. What are common DNS record types (A, AAAA, CNAME, MX, TXT, SRV)?
  4. How do DNS caching and TTL (Time-to-Live) impact performance?
  5. How does Amazon Route 53 provide high availability and latency-based routing?
  6. What is DNS poisoning/spoofing? How do you prevent it?
  7. Explain how CDN providers integrate with DNS for load balancing.
  8. Design a fault-tolerant DNS resolution system for a global e-commerce app.

⚖️ Load Balancer
  1. What’s the difference between Layer 4 and Layer 7 load balancers?
  2. How does AWS ELB detect unhealthy instances and reroute traffic?
  3. Explain sticky sessions and when you’d use them.
  4. How would you design a load balancing strategy for a real-time chat system?
  5. What’s the role of reverse proxies like Nginx in load balancing?
  6. How do load balancers handle SSL/TLS termination?
  7. Explain how global load balancing differs from regional load balancing.
  8. How do you design a load balancer setup that supports disaster recovery?

🗄️ Database
  1. Compare SQL vs NoSQL databases and when to use each.
  2. How does database replication improve availability?
  3. What is the difference between master-slave and master-master replication?
  4. How do sharding and partitioning differ?
  5. Explain CAP theorem and its application in database systems.
  6. How does DynamoDB achieve horizontal scalability?
  7. Design a schema for a high-volume transaction system (like UPI).
  8. How do you handle database migration in production without downtime?

🔑 Key-Value Store
  1. How do Redis and DynamoDB differ in design and use cases?
  2. What are typical eviction policies in Redis?
  3. How do you ensure consistency in a distributed key-value store?
  4. Explain write-through, write-back, and write-around caching strategies.
  5. Design a leaderboard system using Redis sorted sets.
  6. How do you prevent cache stampede and thundering herd problems?
  7. What’s the difference between strong and eventual consistency in key-value stores?
  8. How do you secure Redis or DynamoDB in production?

🚀 Content Delivery Network (CDN)
  1. How does a CDN reduce latency for end users?
  2. What’s the difference between push CDN and pull CDN?
  3. How do you handle cache invalidation in CDN?
  4. Explain edge caching and its importance in video streaming.
  5. How does Cloudflare handle DDoS protection via CDN?
  6. Design a global CDN strategy for an OTT video platform.
  7. What are challenges in serving dynamic vs static content via CDN?

🔢 Sequencer
  1. Why do distributed systems need a sequencer?
  2. Explain Twitter’s Snowflake ID generator design.
  3. What are trade-offs between UUID and Snowflake IDs?
  4. How do you design a sequencer to avoid collisions in a high-scale system?
  5. How do sequencers preserve causality in event processing?
  6. What are scalability challenges in a centralized sequencer?
  7. How do you implement distributed ID generation in DynamoDB?

📊 Service Monitoring
  1. What are key metrics to monitor in distributed systems?
  2. How do you implement health checks in microservices?
  3. What’s the difference between black-box and white-box monitoring?
  4. Explain Prometheus pull vs push metrics collection.
  5. How do you design an alerting system to avoid alert fatigue?
  6. How does AWS CloudWatch integrate with auto-scaling?
  7. What is the difference between logging, monitoring, and observability?

⚡ Distributed Caching
  1. Why do we need distributed caching instead of local cache?
  2. How does Redis cluster ensure partition tolerance?
  3. What are cache invalidation strategies?
  4. Explain cache-aside vs read-through caching.
  5. Design a caching strategy for an e-commerce product catalog.
  6. How do you ensure consistency in multi-node cache clusters?
  7. What are common pitfalls of caching (e.g., stale data, cache stampede)?

📩 Distributed Messaging Queue
  1. Compare Kafka vs RabbitMQ vs AWS SQS.
  2. How does Kafka ensure message durability?
  3. What is the difference between at-most-once, at-least-once, and exactly-once delivery?
  4. How does partitioning work in Kafka?
  5. Design a messaging system for order processing in e-commerce.
  6. What are dead-letter queues (DLQs) and when do you use them?
  7. How do you handle backpressure in message queues?

📢 Publish-Subscribe System
  1. How does pub-sub differ from point-to-point messaging?
  2. Explain fan-out in pub-sub systems.
  3. What challenges arise in maintaining message ordering in pub-sub?
  4. How does Google Pub/Sub scale globally?
  5. Design an event-driven architecture using SNS + SQS + Lambda.
  6. What’s the role of Kafka topics and consumer groups in pub-sub?

🚦 Rate Limiter
  1. Why do we need rate limiting in APIs?
  2. Compare token bucket vs leaky bucket algorithms.
  3. How does API Gateway in AWS implement rate limiting?
  4. Design a rate limiter for login attempts in a banking app.
  5. How do you prevent abuse when scaling rate limiters across multiple servers?
  6. What’s the role of Redis in implementing distributed rate limiting?

  1. How does Elasticsearch index data?
  2. What’s the difference between inverted index and forward index?
  3. How does distributed search handle sharding and replication?
  4. Explain full-text search vs keyword search.
  5. Design a search engine for an e-commerce platform.
  6. What are common scaling challenges with Elasticsearch?
  7. How do you implement autocomplete efficiently in search?

📝 Distributed Logging
  1. Why is centralized logging important in microservices?
  2. How does ELK stack (Elasticsearch, Logstash, Kibana) work together?
  3. What’s the difference between structured and unstructured logs?
  4. How do you design a log aggregation system for 1M+ events/sec?
  5. What are common challenges in log storage and retention?
  6. How does CloudWatch Logs differ from Splunk?

🕵️ Distributed Tracing
  1. What problem does distributed tracing solve?
  2. Explain how AWS X-Ray works.
  3. What are spans and traces in OpenTelemetry?
  4. How do you trace a request across multiple microservices?
  5. Design a tracing setup for a large-scale payments platform.
  6. What are challenges in tracing asynchronous calls?

⏳ Distributed Task Scheduling
  1. What’s the difference between cron jobs and distributed schedulers?
  2. How does Apache Airflow manage DAG dependencies?
  3. How do AWS Step Functions handle retries and rollbacks?
  4. Design a scheduling system for nightly ETL jobs processing TBs of data.
  5. What challenges arise in distributed task scheduling across regions?
  6. How do you handle idempotency in scheduled tasks?

🗂️ Blob Storage
  1. What’s the difference between object storage and block storage?
  2. How does S3 achieve 11 9’s durability?
  3. Explain eventual consistency in S3 read-after-write model.
  4. How do you secure access to blob storage in cloud?
  5. Design a storage strategy for a video streaming platform.
  6. What’s the difference between hot, warm, and cold storage tiers?
  7. How do you handle data lifecycle management in blob storage?