● HLD PREP   AI SYSTEMS TRACK   ·   13 CORE TOPICS
DOCUMENT v1 · PERSONAL · SID
Preparation Dossier / System Design Interviews

High-Level Design.
For AI-native engineers.

A structured 8-week prep plan covering the 13 core HLD topics you listed — each one reframed through the lens of modern AI infrastructure: LLM inference, RAG pipelines, agentic systems, GPU autoscaling, and observability for non-deterministic workloads. Every topic has curated resources (books, blogs, YouTube, papers, docs), an AI-specific angle, and interview questions to self-test against.

13+1
Core Topics
8wk
Study Plan
80+
Curated Resources
40h
Mock Design Time

Table of contents

The plan, in four phases

Treat this like a curriculum. Don't jump ahead — each phase assumes comfort with the previous. Phase 1 gives you the vocabulary; Phase 2 gives you the plumbing; Phase 3 makes it scale; Phase 4 is where you combine everything into full AI systems you can defend in an interview.

PHASE 01
Foundations
Weeks 1–2
Language & concurrency primitives, core distributed-systems vocabulary, RPC/gRPC basics. The stuff every other topic depends on.
PHASE 02
Plumbing
Weeks 3–4
Databases (partition, shard, tune), caching, load balancers, rate limiting, idempotency. The stateful machinery.
PHASE 03
Scale & Ops
Weeks 5–6
Kubernetes & GPU autoscaling, CI/CD, observability, DNS/CDN, event-driven architectures with Kafka/RabbitMQ.
PHASE 04
AI Systems
Weeks 7–8
LLM serving (vLLM, SGLang), RAG pipelines, agent infra, semantic caching, evals. Mock interviews end-to-end.
⌃ How to use this doc For each topic: read the summary, study the resources, write down the AI-specific twist, then answer the practice questions out loud as if in an interview. The goal isn't memorization — it's being able to reason about trade-offs when the interviewer says "but what if the traffic spikes 10x?" or "what breaks when you move from one GPU to a fleet?"
01Language skills & concurrency
FOUNDATION AI-RELEVANT

Concurrency is the first filter in HLD interviews. Expect questions on goroutines vs threads, async/await, GIL behaviour, channels, mutexes, context cancellation, and how your language model handles backpressure. For AI systems specifically: async I/O matters enormously because LLM calls are slow (seconds) and parallel tool calls are common.

Core concepts to nail

Primitives

  • Processes vs threads vs coroutines
  • Shared memory vs message passing
  • Mutex, semaphore, RWLock, atomic ops
  • Channels (Go), Futures (Rust), Promises (JS)
  • Actor model (Erlang, Akka)
  • Context cancellation, deadlines, timeouts

Traps

  • Deadlock, livelock, starvation
  • Race conditions, memory visibility, reordering
  • Thread pool exhaustion under slow I/O
  • Python GIL — when it hurts, when it doesn't
  • Async-over-sync contamination ("coloured functions")
  • Goroutine leaks from un-cancelled contexts
⌃ AI Angle Your LangGraph agents fan out to multiple tools in parallel. You need asyncio.gather with per-task timeouts, bounded semaphores to cap concurrent LLM calls (so you don't blow your rate limit), and proper cancellation so a slow tool doesn't hold up the orchestrator. In Go, this is errgroup + context.WithTimeout. Know this pattern cold — it comes up in every agent system design.

Resources

  • BookConcurrency in Go — Katherine Cox-BudayO'REILLY →
  • BookDesigning Data-Intensive Applications, Ch. 7 & 8 — Martin KleppmannDDIA →
  • VideoRob Pike — Concurrency Is Not ParallelismYOUTUBE →
  • BlogWhat Color Is Your Function? — Bob Nystrom (on async)READ →
  • DocsPython asyncio patterns & pitfallsDOCS →
  • Hands-onBuild: concurrent LLM fan-out with timeouts + semaphore + error aggregationYOUR CODE
02CI/CD — depth for AI deployments
FOUNDATION AI-RELEVANT

Classic CI/CD is well-trodden. The AI twist is: what do you test when the output is probabilistic? Canary by eval score, shadow deployments, prompt versioning, model rollbacks, and feature-flagged prompts are all fair game.

Standard pillars

CI concepts

  • Build, test, lint, SAST, SCA stages
  • Matrix builds, caching, artefact stores
  • Trunk-based vs GitFlow
  • PR-preview environments

CD strategies

  • Blue/green, canary, rolling, recreate
  • Feature flags (LaunchDarkly, Unleash)
  • GitOps with ArgoCD / Flux
  • Progressive delivery (Flagger, Argo Rollouts)
⌃ AI Angle — what's different A prompt change is a deploy. A model version bump is a deploy. Fine-tune an adapter — deploy. You need: (1) an eval suite that runs in CI (golden datasets, LLM-as-judge, pairwise prefs) and gates promotion; (2) shadow traffic to compare new prompt/model against prod without affecting users; (3) canary by percentage with auto-rollback on quality regression, not just on 5xx rate. This is the modern MLOps/LLMOps delivery loop.

Resources

  • BookContinuous Delivery — Humble & Farley (the canonical text)SITE →
  • BlogGoogle SRE Workbook — chapter on canarying releasesREAD →
  • BlogMartin Fowler — Continuous Delivery for ML (CD4ML)READ →
  • BlogChip Huyen — CI/CD for machine learningHUYENCHIP →
  • VideoArgoCD GitOps in 100 seconds & full walkthrough — TechWorld with NanaYOUTUBE →
  • DocsArgo Rollouts — analysis-based promotionDOCS →
  • Hands-onBuild: GitHub Actions pipeline that runs LLM evals on PR & blocks merge on regressionYOUR CODE
03Observability — setup, query, visualise
FOUNDATION AI-RELEVANT HIGH-YIELD

The three pillars: metrics, logs, traces. Interviewers want to hear you say Prometheus + Grafana for metrics, OpenTelemetry for traces, Loki / ELK for logs. For AI: you also need per-request token counts, TTFT, p95 generation latency, KV-cache utilisation, and quality metrics (hallucination rate, groundedness). Lookover's whole thesis sits here.

Core stack to know cold

PillarToolWhat to practise
MetricsPrometheus + GrafanaWrite PromQL: rate(), histogram_quantile, recording rules, alerts
TracesOpenTelemetry + Jaeger/TempoPropagate trace context across async boundaries, span attributes
LogsLoki or ELKStructured JSON logs, correlation IDs, log-to-trace linking
ProfilingPyroscope / pprofContinuous profiling in production
LLM-specificLangfuse, LangSmith, HeliconePrompt/response tracing, cost, eval runs
⌃ AI Angle — the metrics that matter For LLM inference: TTFT (time-to-first-token), tokens/sec, queue duration, KV cache utilisation, prompt tokens, completion tokens, $/request. vLLM and TGI expose most of these as Prometheus metrics natively. Your Grafana dashboard should answer: "Is latency degrading because the queue is growing, or because prompts are getting longer?"
# Example PromQL for LLM serving
# p95 TTFT by model
histogram_quantile(0.95,
  sum by (le, model) (rate(vllm_time_to_first_token_seconds_bucket[5m]))
)

# GPU memory pressure
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# Queue depth (scale trigger)
sum(vllm_num_requests_waiting) by (model)

Resources

  • BookObservability Engineering — Majors, Fong-Jones, Miranda (Honeycomb)O'REILLY →
  • BlogZerodha — Monitoring stack with Prometheus, Grafana, VictoriaMetricsREAD →
  • BlogMonitor LLM inference with Prometheus & Grafana (vLLM, TGI, llama.cpp)READ →
  • VideoPrometheus + Grafana crash course — TechWorld with NanaYOUTUBE →
  • DocsOpenTelemetry — concepts & semantic conventions for GenAIDOCS →
  • DocsLangfuse — open-source LLM observabilityDOCS →
  • Hands-onBuild: vLLM + Prometheus + Grafana locally; design a dashboard for one modelYOUR CODE
04Databases — partitioning, sharding, tuning
FOUNDATION AI-RELEVANT HIGH-YIELD

This is a multi-topic, high-yield area. You need to distinguish partitioning (logical split within one DB) from sharding (horizontal split across nodes), speak fluently about tuning, and for AI systems specifically — vector databases, hybrid search, and why pgvector-on-Postgres vs a dedicated vector DB matters.

4.1 Partitioning

Strategies

  • Range (date-based — most common)
  • List (discrete values: region, tenant)
  • Hash (uniform distribution)
  • Composite (range + hash)

Wins

  • Partition pruning — query only relevant chunks
  • Parallel query plans
  • Drop old partitions for archival (fast!)
  • Index size stays manageable

4.2 Sharding

Key decisions: shard key (avoid hotspots), resharding strategy (consistent hashing vs range splits), cross-shard queries (scatter-gather or avoid). Know Vitess (MySQL), Citus (Postgres), MongoDB native sharding, and how Discord reshards.

4.3 Tuning — the Zerodha playbook

You asked specifically for the Zerodha Postgres blog. Here it is, plus more.

⌃ AI Angle — vector & hybrid search RAG systems need vector search. Know the trade-offs: pgvector (Postgres extension — keeps metadata JOINs easy) vs dedicated engines (Pinecone, Weaviate, Qdrant, Milvus). Understand IVFFlat vs HNSW indexes. Understand hybrid search (BM25 + vector, fused with Reciprocal Rank Fusion). For most startup-scale RAG, pgvector on Postgres is the right answer — and that's a strong, defensible interview take.

Resources

  • BlogZerodha — Scaling with common sense — Kailash NadhREAD →
  • BlogZerodha — Working with PostgreSQL (the definitive tuning post)READ →
  • BlogZerodha — 7M Postgres tables reporting hackREAD →
  • VideoKailash Nadh — Scaling 7M+ Postgres Tables (talk)YOUTUBE →
  • BookDesigning Data-Intensive Applications, Ch. 5 & 6 — Martin KleppmannDDIA →
  • BlogUse the Index, Luke — the practical SQL index guideREAD →
  • BlogDiscord — How we reshard trillions of messagesREAD →
  • Docspgvector — HNSW & IVFFlat index docsGITHUB →
  • PaperHNSW: Efficient & robust approximate nearest neighbour search — Malkov & YashuninARXIV →
  • Hands-onBuild: partition a time-series table in pg, run EXPLAIN ANALYZE before/afterYOUR CODE
05Kubernetes — deploy, scale, GPUs
FOUNDATION AI-RELEVANT HIGH-YIELD

Classic K8s topics: Deployments, Services, Ingress, ConfigMaps, probes, HPA. For AI: GPU scheduling, MIG (multi-instance GPU), KEDA for scale-to-zero, and DCGM metrics. This is the hottest intersection in interviews right now — GPU-aware autoscaling is a real skill gap in the market.

5.1 Deployment fundamentals

Must-know

  • Pod / Deployment / StatefulSet / DaemonSet
  • Service types: ClusterIP, NodePort, LoadBalancer
  • Ingress + ingress controllers (nginx, traefik)
  • Liveness, readiness, startup probes
  • Resource requests vs limits, QoS classes
  • RollingUpdate strategy, maxSurge, maxUnavailable

Production concerns

  • Pod Disruption Budgets (PDBs)
  • PodAntiAffinity for HA
  • NetworkPolicies for zero-trust
  • ServiceAccounts + RBAC
  • Secrets (sealed-secrets or external-secrets)
  • Graceful shutdown + preStop hooks

5.2 Minikube for local dev

For practice, minikube or kind is sufficient — but if you want to touch GPUs locally, use kind with the NVIDIA device plugin or just run vLLM in Docker with --gpus all. For production-like labs, GKE, EKS and AKS all have free-tier-ish GPU nodes.

5.3 Autoscaling — the deep cut

HPA scales on CPU/memory. For AI, that's useless. Your LLM pod is GPU-bound and queue-bound. You need KEDA with Prometheus-based triggers.

⌃ AI Angle — scale on queue depth, not CPU The 2026 interview answer for "how do you scale LLM inference?" is: HPA on custom metrics (vllm_num_requests_waiting, TTFT, GPU utilisation via DCGM exporter) or KEDA to scale to zero during idle windows. Combine with MIG on A100/H100 for multi-tenant isolation. For cold starts, pre-pull images and cache model weights on a PVC. Know these numbers: 7B model cold start ≈ 30–60s with cached weights; 70B ≈ 2–5 min.
# KEDA ScaledObject — scale vLLM from 0–8 on queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 0         # scale to zero when idle
  maxReplicaCount: 8
  cooldownPeriod: 300
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_depth
      query: sum(vllm_num_requests_waiting)
      threshold: "5"

Resources

  • BookKubernetes in Action — Marko Lukša (the definitive book)MANNING →
  • VideoKubernetes full course — TechWorld with Nana (free, 4+ hours)YOUTUBE →
  • BlogDeploying LLMs on Kubernetes: vLLM, Ray Serve & GPU scheduling (2026)READ →
  • BlogAutoscaling K8s GPU workloads — a complete production guideMEDIUM →
  • BlogAuto-scaling GPU inference pods with KEDA + cost guardsREAD →
  • DocsKEDA — scalers catalogue (Prometheus, Kafka, HTTP, RabbitMQ…)DOCS →
  • DocsNVIDIA DCGM exporter — GPU metrics for PrometheusGITHUB →
  • Hands-onBuild: deploy vLLM on minikube/kind with HPA on a custom metricYOUR CODE
06Load balancers
FOUNDATIONAI-RELEVANT

L4 vs L7. Algorithms (round-robin, least-conns, weighted, consistent hashing, EWMA). Health checks. Session affinity. For AI: least-request balancing is rarely right for LLM servers — you need KV-cache-aware routing because directing a conversation continuation to a worker that already has the prefix cached is an order-of-magnitude latency win.

Core concepts

Types

  • L4 (TCP/UDP) — HAProxy, AWS NLB
  • L7 (HTTP) — nginx, Envoy, Traefik, AWS ALB
  • DSR (direct server return)
  • GSLB — global/geo-based (covered in DNS)
  • Service mesh sidecar LB (Envoy via Istio/Linkerd)

Algorithms

  • Round-robin, weighted round-robin
  • Least connections, least response time
  • IP hash / consistent hashing (sticky)
  • Power-of-two-choices (P2C)
  • EWMA (exponentially-weighted moving average)
⌃ AI Angle — prefix-aware routing Modern LLM routers (vLLM Production Stack, llm-d, NVIDIA Dynamo) implement KV-cache-aware routing: hash the prompt prefix and prefer a worker whose cache already holds it. Combined with prefill/decode disaggregation (some workers do the one-shot prefill, others do token-by-token decode), this is the cutting edge of LLM load balancing. If you can explain this in an interview, you'll stand out.

Resources

  • BlogCloudflare — Load balancing at the edge (technical deep dive)BLOG →
  • BlogNetflix — Rethinking Netflix's Edge Load BalancingREAD →
  • VideoSystem Design — L4 vs L7 Load Balancers — ByteByteGoYOUTUBE →
  • DocsEnvoy Proxy — HTTP load balancing configurationDOCS →
  • BlogThe New Stack — Six frameworks for efficient LLM inferencing (covers routing)READ →
  • PaperThe Power of Two Choices in Randomized Load Balancing — MitzenmacherPDF →
07Rate limiting — leaky & token bucket
FOUNDATIONAI-RELEVANT

Know all five classic algorithms: fixed window, sliding window, sliding window log, token bucket, leaky bucket. Know where you apply them (client, edge, service, DB). For AI systems, rate-limit by tokens, not just requests — otherwise one 100k-token prompt can starve a thousand small ones.

Algorithm cheatsheet

AlgorithmGistPros / Cons
Fixed windowCounter per time bucketSimple; burst at bucket edges
Sliding window logStore every timestampAccurate; high memory
Sliding window counterWeighted average of adjacent windowsGood balance — most common
Token bucketTokens refill at rate r, burst to bAllows controlled bursts
Leaky bucketQueue drained at constant rateSmooths output; may add latency
⌃ AI Angle — token-based limits OpenAI, Anthropic, Google all rate-limit on tokens per minute (TPM) in addition to requests per minute (RPM). Your proxy needs to pre-estimate token count from the prompt and reserve capacity. Design a fair-queueing scheme so a big prompt doesn't monopolise — weighted fair queueing with token-cost weights. Also: implement exponential backoff with jitter for 429s from upstream providers.

Resources

  • BlogStripe — Scaling your API with rate limitersREAD →
  • BlogFigma — An alternative approach to rate limitingREAD →
  • VideoRate Limiting Fundamentals — System Design Interview (Alex Xu/ByteByteGo)YOUTUBE →
  • BlogCloudflare — How we built rate limiting capable of scaling to millionsREAD →
  • DocsOpenAI cookbook — how to handle rate limitsCOOKBOOK →
  • Hands-onBuild: token-bucket rate limiter in Redis that rate-limits by estimated LLM tokensYOUR CODE
08Idempotency
FOUNDATIONAI-RELEVANT

Idempotency keys, dedup windows, retries with exponential backoff + jitter, at-least-once vs exactly-once semantics. Every payment system you've built at DodoPayments lives or dies by this. For AI: agents retry tool calls constantly, and a non-idempotent "send email" tool is a disaster waiting to happen.

Key patterns

Producer side

  • Generate idempotency-key on the client
  • Include in request header (e.g. Idempotency-Key)
  • Retry with same key on failure

Consumer side

  • Dedup store (Redis, DynamoDB) with TTL
  • Unique constraint at DB level as backstop
  • Outbox pattern for transactional publishing
  • Transactional inbox for consumers
⌃ AI Angle — tool-call idempotency When an agent calls send_email(to=X, subject=Y, body=Z) and times out, did the email send? Design your tool interface so every call takes an idempotency-key derived from the agent's thought-hash. The orchestrator dedupes on this. This is crucial for your LangGraph agents — the DodoPayments Refund Orchestrator cannot double-refund, ever.

Resources

  • BlogStripe — Designing robust and predictable APIs with idempotencyREAD →
  • BlogBrandur Leach — Implementing Stripe-like idempotency keys in PostgresREAD →
  • BlogMicroservices.io — Transactional outbox & inbox patternsREAD →
  • VideoDesigning for failure: exactly-once semantics explained — Arjan Codes / ByteByteGoYOUTUBE →
09Caching & invalidation
FOUNDATIONAI-RELEVANTHIGH-YIELD

Read-through, write-through, write-behind, cache-aside. TTL vs LRU vs LFU eviction. Stampede (thundering herd) prevention. For AI: two massive wins — prompt caching (Anthropic & OpenAI both expose it) and semantic caching (embed the query, look up near-matches, return cached response if cosine similarity > threshold).

9.1 Strategies

PatternFlowTrade-off
Cache-aside (lazy)App checks cache → miss → load from DB → populate cacheSimple; first request slow
Read-throughCache loads from DB on miss transparentlyClient doesn't know about DB
Write-throughWrite to cache & DB synchronouslySlow writes; strong consistency
Write-behind (back)Write to cache; flush async to DBFast writes; risk on cache loss
Refresh-aheadCache proactively refreshes before TTLHides latency; may over-fetch

9.2 Invalidation — the hard part

The two hardest things in CS are cache invalidation, naming things, and off-by-one errors. Strategies: TTL (lazy), event-driven invalidation (publish change events), versioned keys (bump version on write), write-through (trivially consistent but slow).

⌃ AI Angle — three caches that matter 1. Prompt caching — Anthropic/OpenAI cache repeated prompt prefixes, cutting cost & latency. Use it for system prompts + long context docs.
2. KV-cache reuse — at the serving layer (vLLM PagedAttention), tokens you've already seen don't need recomputation.
3. Semantic caching — embed the user query, check vector store for a near-match past response. Ship only if similarity > 0.95 and the cached answer is still fresh. Watch out: semantic cache poisoning is real.

Resources

  • BookDesigning Data-Intensive Applications, Ch. 3 — Martin KleppmannDDIA →
  • BlogFacebook — Scaling Memcache at Facebook (classic paper-blog)READ →
  • BlogRedis — Client-side caching & invalidationDOCS →
  • BlogAnthropic — Prompt caching (official guide)DOCS →
  • BlogSemantic caching in LLM pipelines — Redis blogREAD →
  • VideoCache patterns explained — ByteByteGoYOUTUBE →
  • Hands-onBuild: semantic cache with Redis + pgvector, measure hit rate on real promptsYOUR CODE
10Distributed systems
FOUNDATIONAI-RELEVANTHIGH-YIELD

The broadest topic. CAP, PACELC, consistency models, consensus (Raft, Paxos), leader election, replication, 2PC/3PC/sagas, vector clocks, CRDTs. This is the "speak the language fluently" topic — you won't be asked to implement Raft, but you must reason about what breaks when a network partitions during your RAG write.

Mental models to own

Foundational

  • CAP & PACELC theorem
  • Consistency: linearizable, sequential, causal, eventual
  • Consensus: Raft, Paxos, ZAB (roughly, when to use)
  • Leader election vs leaderless (Dynamo-style)
  • Quorum reads/writes (W + R > N)

Practical

  • Sagas (orchestration vs choreography)
  • Two-phase commit & why it's rare
  • Outbox pattern & change data capture (CDC)
  • Distributed tracing & clock skew (Lamport, vector clocks)
  • Partition tolerance strategies: retries, hedging, fallback
⌃ AI Angle — your agents are distributed systems A multi-step LangGraph agent running 4 tool calls across 3 external APIs is a distributed system. You'll be asked: what happens if tool call 3 of 4 succeeds but the agent crashes before committing state? Answer: durable execution (Temporal, Restate, AWS Step Functions) or your own checkpointing. For multi-agent systems, you have a consensus problem: which agent's answer wins? Know this vocabulary.

Resources

  • BookDesigning Data-Intensive Applications — Martin Kleppmann (the entire book)DDIA →
  • BookUnderstanding Distributed Systems — Roberto VitilloSITE →
  • VideoMIT 6.824 Distributed Systems lectures — Robert Morris (free on YT)YOUTUBE →
  • PaperIn Search of an Understandable Consensus Algorithm (Raft)PDF →
  • PaperDynamo: Amazon's Highly Available Key-value StorePDF →
  • BlogJepsen — consistency analyses (the gold standard)JEPSEN →
  • DocsTemporal — durable execution for agents & workflowsDOCS →
11Event-driven architectures — Kafka & RabbitMQ
FOUNDATIONAI-RELEVANT

Kafka and RabbitMQ solve different problems. Kafka is a distributed log — durable, replayable, high throughput, good for streams and CDC. RabbitMQ is a message broker — flexible routing, lower throughput, good for task queues. For AI workloads: Kafka for ingestion & evaluation streams, RabbitMQ (or SQS) for inference task queues.

Core differences

KafkaRabbitMQ
ModelDistributed log (consumer pull)Broker (push, routing)
OrderingPer-partitionPer-queue (with caveats)
Throughput100k–1M+ msg/s10k–50k msg/s
RetentionDays/weeks/foreverUntil consumed
ReplayYes, nativeNo (dead-letter workaround)
Best fitEvent streaming, CDC, analytics, audit logsTask queues, work distribution, RPC-ish
⌃ AI Angle — where each fits Kafka: stream all LLM request/response pairs for offline evaluation & fine-tuning dataset building. Use Kafka Streams or Flink for real-time drift detection on embeddings. RabbitMQ/SQS: async document ingestion for RAG (user uploads PDF → queue → worker chunks + embeds + stores), and long-running inference jobs (image gen, batch transcription). KEDA can scale both Kafka and RabbitMQ consumers natively.

Resources

  • BookKafka: The Definitive Guide — Shapira et al. (O'Reilly)O'REILLY →
  • BlogConfluent — Kafka fundamentals & design patternsCONFLUENT →
  • VideoApache Kafka in 6 minutes + deep dives — ByteByteGoYOUTUBE →
  • BlogRabbitMQ vs Kafka — when to use which — Jack VanlightlyREAD →
  • DocsRabbitMQ — work queues tutorialDOCS →
  • BlogUber — Real-time data infrastructure with KafkaREAD →
  • Hands-onBuild: RAG ingestion pipeline — upload → RabbitMQ → worker → pgvectorYOUR CODE
12DNS & CDN
FOUNDATIONAI-RELEVANT

DNS: recursive resolvers, authoritative, TTL, DNS-based load balancing (GeoDNS), anycast. CDNs: edge vs origin, cache-control, origin shield, signed URLs, Workers/edge functions. For AI: edge inference is becoming real (Cloudflare Workers AI, Vercel AI SDK on edge). Know it.

Essentials

DNS

  • Record types (A, AAAA, CNAME, MX, TXT, SRV)
  • Recursive vs iterative resolution
  • TTL trade-offs (low = flexibility, high = resilience)
  • GeoDNS / latency-based routing (Route53, NS1)
  • Anycast for global presence

CDN

  • Cache-Control, Surrogate-Control, ETag
  • Origin shield, tiered caching
  • Stale-while-revalidate, stale-if-error
  • Signed URLs for private assets
  • Edge functions (Cloudflare Workers, CloudFront Functions)
⌃ AI Angle — edge inference & regional serving CDNs now run LLMs at the edge: Cloudflare Workers AI, Vercel AI Gateway, AWS Bedrock with regional endpoints. For global apps, route users to the closest model region via latency-based DNS. Cache embeddings at the edge (they're small, static, cacheable). Cache model responses behind a Vary: Authorization header. These are all high-signal details in a senior interview.

Resources

  • VideoDNS explained in depth — Julia Evans (zines + blog)READ →
  • BlogCloudflare Learning Center — DNS, CDN, anycast (free, excellent)LEARN →
  • BlogHigh Scalability — How CDNs work at scaleBLOG →
  • DocsCloudflare Workers AI — LLMs at the edgeDOCS →
  • VideoCDN Design — ByteByteGoYOUTUBE →
13RPC & gRPC
FOUNDATIONAI-RELEVANT

REST vs gRPC vs GraphQL vs tRPC. Know when gRPC wins: internal service-to-service, streaming, strict typing, polyglot. Protobuf schema evolution. Interceptors, deadlines, metadata. For AI: server-sent events (SSE) and gRPC streams for token streaming; Model Context Protocol (MCP) is the new standard for tool calls.

REST vs gRPC — when to use what

CriterionRESTgRPC
TransportHTTP/1.1 or /2, JSONHTTP/2, Protobuf (binary)
TypingOpenAPI (optional)Strict, via .proto
StreamingSSE or WebSocketsNative bi-di streams
BrowserFirst classNeeds gRPC-Web proxy
Best fitPublic APIs, browser clientsInternal microservices, low-latency
⌃ AI Angle — streaming protocols for LLMs Token-by-token streaming is non-negotiable for UX. Three options: SSE (simple, HTTP-compatible, browser-friendly — the OpenAI/Anthropic default), WebSockets (bi-directional, good for voice/interrupt), gRPC streaming (internal service mesh). For tool-calling, MCP (Model Context Protocol) is Anthropic's open standard — worth reading their spec, it's essentially a structured RPC layer for LLM tools.

Resources

  • BookgRPC: Up and Running — Kasun Indrasiri (O'Reilly)O'REILLY →
  • VideogRPC vs REST — which one should you use? — ByteByteGoYOUTUBE →
  • DocsgRPC official — concepts, streaming, interceptorsDOCS →
  • DocsModel Context Protocol (MCP) — Anthropic specMCP →
  • BlogNetflix — gRPC at Netflix (service mesh + observability)READ →
14AI add-on — serving, RAG, agents (must cover)
AI-NATIVECRITICAL

You said "tweak the plan around AI" — this entire section is the tweak. GenAI/LLM system design is now a standalone interview category at OpenAI, Anthropic, Google, Meta, and every startup hiring AI engineers. Three sub-topics: LLM serving, RAG, and agents.

14.1 LLM Inference & Serving

Know the landscape: vLLM (open-source, PagedAttention, highest throughput), TensorRT-LLM (NVIDIA, best perf on their hardware), Hugging Face TGI (ecosystem integration), SGLang (structured generation, prefix caching), llama.cpp / Ollama (local & edge). Key concepts: continuous batching, PagedAttention, speculative decoding, tensor parallelism, prefill/decode disaggregation.

14.2 RAG — Retrieval-Augmented Generation

End-to-end pipeline: parse → chunk → embed → store → retrieve → rerank → prompt → generate. For each stage, know 2–3 options and their trade-offs. Chunking strategy is where most RAG systems die — semantic chunking beats fixed-size for most document types but costs more. Always implement hybrid search (BM25 + vector) + reranking.

14.3 Agents & Tool-Use

An agent is an LLM with an execution loop, tools, memory, and guardrails. The LLM is ~20% of the system — the rest is infrastructure: orchestrator (LangGraph, custom), tool registry, sandbox, policy engine, observability. This is squarely in Lookover's wheelhouse — and your Claude dossier on the LangGraph automation stack reflects exactly the right mental model.

⌃ The must-know system designs
1. "Design an LLM chatbot with RAG over 10M docs."
2. "Design the inference serving layer for a popular open-source LLM."
3. "Design an AI agent that can take actions on behalf of users safely."
4. "Design an eval & observability platform for production LLM apps." (literally Lookover)
5. "Design a semantic cache for an LLM API proxy."

You should be able to whiteboard each of these, naming components, trade-offs, and failure modes.

Resources — essential

  • BookDesigning Machine Learning Systems — Chip Huyen (O'Reilly, 2022)O'REILLY →
  • BookAI Engineering: Building Applications with Foundation Models — Chip Huyen (O'Reilly, 2024)O'REILLY →
  • BlogChip Huyen — huyenchip.com (entire blog)BLOG →
  • BlogEugene Yan — Patterns for building LLM-based systems & productsREAD →
  • BlogLilian Weng — LLM Powered Autonomous Agents (the canonical post)READ →
  • BlogAnthropic — Building effective agentsANTHROPIC →
  • BlogGenerative AI System Design Interview Guide 2026 — PracHubREAD →
  • BlogIGotAnOffer — GenAI system design interview (examples & framework)READ →
  • BlogAgentic AI System Design Interview Guide 2026MEDIUM →
  • DocsvLLM documentation — PagedAttention, continuous batching, servingDOCS →
  • BlogBest LLM Inference Engines 2026 — vLLM, TensorRT-LLM, TGI, SGLangREAD →
  • PaperEfficient Memory Management for LLM Serving with PagedAttention — Kwon et al. (vLLM paper)ARXIV →
  • PaperRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.ARXIV →
  • PaperReAct: Synergizing Reasoning and Acting in Language ModelsARXIV →
  • VideoAndrej Karpathy — Let's build a GPT + Deep Dive into LLMsYOUTUBE →
  • VideoFull-stack LLM Bootcamp — Charles Frye et al. (free)FSDL →
  • Hands-onBuild: end-to-end RAG on your Lookover compliance docs — measure precision@kYOUR CODE

The 8-week schedule

Aim for ~10 hours a week: 5 hours reading/watching, 3 hours hands-on building, 2 hours mock interviewing (out loud, whiteboard on paper). If 10h/week is too much, stretch to 12 weeks — don't compress below 8.

WEEK 01 · FOUNDATIONS
Language & concurrency
  • Read DDIA ch. 7–8
  • Watch Rob Pike concurrency
  • Build: concurrent LLM fan-out
  • Mock: "design a URL shortener"
WEEK 02 · FOUNDATIONS
Distributed systems + RPC
  • Start MIT 6.824 lectures 1–4
  • Read CAP, PACELC, Dynamo paper
  • gRPC tutorial + MCP spec
  • Mock: "design a chat system"
WEEK 03 · PLUMBING
Databases deep-dive
  • Zerodha Postgres blog trio
  • DDIA ch. 5–6 (replication, partitioning)
  • pgvector hands-on
  • Mock: "design news feed storage"
WEEK 04 · PLUMBING
Caching, LB, rate limits, idempotency
  • Stripe rate limits & idempotency posts
  • Facebook memcache paper
  • Build: semantic cache prototype
  • Mock: "design payment processor"
WEEK 05 · SCALE
Kubernetes & GPU autoscaling
  • Nana K8s crash course
  • PreMAI LLM-on-K8s guide
  • Build: vLLM on minikube + HPA
  • Mock: "design YouTube video upload"
WEEK 06 · SCALE
CI/CD, observability, DNS/CDN, events
  • Martin Fowler CD4ML
  • Observability Engineering (skim)
  • Kafka fundamentals
  • Mock: "design Twitter/X timeline"
WEEK 07 · AI
LLM serving + RAG
  • Chip Huyen — AI Engineering ch. 4–8
  • vLLM paper + docs
  • Build: RAG over your own docs
  • Mock: "design a doc-QA chatbot"
WEEK 08 · AI
Agents + end-to-end mocks
  • Lilian Weng agents post
  • Anthropic building-effective-agents
  • 5× 45-min mock interviews
  • Write your own Lookover design doc

30 practice problems

Drill these over weeks 4–8. Pick one, give yourself 45 minutes, whiteboard it, talk it out loud (record yourself if possible), then compare against a reference answer. The AI-specific ones marked ⌃ are the highest-yield for the roles you're targeting.

Classic HLD warm-ups (weeks 1–4)

  1. Design a URL shortener (bit.ly). Handle 100B URLs.
  2. Design a distributed rate limiter using Redis. Support sliding window & token bucket.
  3. Design a news feed system (Twitter/X style).
  4. Design a payment processor with strict idempotency.
  5. Design a typeahead/autocomplete service.
  6. Design a notification system (email + push + SMS, retries, dedup).
  7. Design a metrics/monitoring system like Datadog at small scale.
  8. Design a distributed cache (write-through, invalidation, consistency).
  9. Design a job scheduler like cron-at-scale.
  10. Design a chat system with read receipts & presence.

AI-focused designs (weeks 5–8) ⌃ HIGH YIELD

  1. Design an LLM inference serving layer for an open-source 70B model. Target 1000 concurrent users, p95 TTFT < 1s.
  2. Design a RAG system over 10M enterprise documents. Handle daily updates.
  3. Design a semantic cache in front of the OpenAI API. Target 30% cost reduction.
  4. Design an agentic workflow orchestrator (Temporal-style) for LLM agents with tool use.
  5. Design an LLM evaluation & observability platform (i.e. Lookover). Cover tracing, evals, alerts.
  6. Design a multi-tenant vector search service with per-tenant isolation and quotas.
  7. Design a fine-tuning pipeline: data prep → training job → eval → deploy.
  8. Design a prompt management system with versioning, A/B testing, and rollback.
  9. Design an image-generation service (Midjourney-lite). Handle queues, priority tiers, NSFW filtering.
  10. Design an AI-powered code review bot that scales to 1000 repos.
  11. Design a real-time voice assistant with sub-500ms round-trip latency.
  12. Design a GPU cluster autoscaler that balances cost vs latency for fluctuating traffic.
  13. Design a guardrails system that sits between users & an LLM. PII redaction, jailbreak detection, output filtering.
  14. Design an LLM gateway/proxy with rate limiting, key rotation, and cost tracking across providers.
  15. Design an EU AI Act compliance evidence-collection pipeline (hi there, Lookover).
  16. Design an embeddings refresh system — detect drift, re-embed, migrate indexes without downtime.
  17. Design a distributed training job scheduler for multi-node LLM fine-tunes.
  18. Design an LLM-as-a-service API platform (OpenRouter/Fireworks style).
  19. Design an AI workflow testing framework — deterministic replay of LLM interactions.
  20. Design a model registry with lineage, evals, and staged promotion (staging → canary → prod).
⌃ How to structure every answer (1) Clarify: scale, latency, cost budget, deterministic vs probabilistic. (2) Functional & non-functional requirements. (3) Back-of-envelope math. (4) High-level diagram — data plane + control plane. (5) Deep-dive one or two components. (6) Discuss trade-offs and failure modes. (7) How you'd monitor & evaluate it. Always end with "what would you want me to go deeper on?"

Three rules that'll separate you

1 — Always quantify

  • "~1000 QPS" not "high traffic"
  • "p95 latency < 200ms" not "fast"
  • "$2/1k tokens input, $6/1k output" not "expensive"
  • "KV cache ≈ 2 × num_layers × hidden_size × seq_len × bytes" → know approximate memory for 7B / 70B

2 — Name trade-offs explicitly

  • "Using semantic cache — trades freshness for cost & latency"
  • "Prefill/decode disaggregation — more complex but better throughput"
  • "KEDA scale-to-zero — saves cost but adds cold start"
  • "Hybrid search — better recall but higher latency and cost"
3 — Ground it in your actual work You're building Lookover and an AI compliance agency. When the interviewer asks a design question, pull from real experience: "At Lookover, I handle this with…", "For the DodoPayments orchestrator, we used idempotency keys because…". Specificity > generic textbook answers, every single time. Your production experience is your edge over textbook-primed candidates.