Preparation Dossier / System Design Interviews

High-Level Design.
For AI-native engineers.

A structured 8-week prep plan covering the 13 core HLD topics you listed — each one reframed through the lens of modern AI infrastructure: LLM inference, RAG pipelines, agentic systems, GPU autoscaling, and observability for non-deterministic workloads. Every topic has curated resources (books, blogs, YouTube, papers, docs), an AI-specific angle, and interview questions to self-test against.

13+1

Core Topics

8wk

Study Plan

80+

Curated Resources

40h

Mock Design Time

Index / Jump To

▾ Thirteen topics, an add-on, and a schedule

01Language & Concurrency⌃ AI
02CI/CD for AI Systems⌃ AI
03Observability & Dashboards⌃ AI
04Databases — Partition, Shard, Tune⌃ AI
05Kubernetes & GPU Autoscaling⌃ AI
06Load Balancers⌃ AI
07Rate Limiting — Leaky Bucket et al.⌃ AI
08Idempotency⌃ AI
09Caching & Invalidation⌃ AI
10Distributed Systems⌃ AI
11Event-Driven — Kafka, RabbitMQ⌃ AI
12DNS & CDN⌃ AI
13RPC & gRPC⌃ AI
14AI Add-on — RAG, Serving, Agents⌃ NEW
158-Week Schedule⌃ PLAN
16Practice Problems⌃ MOCK

Overview / 8-Week Structure

The plan, in four phases

Treat this like a curriculum. Don't jump ahead — each phase assumes comfort with the previous. Phase 1 gives you the vocabulary; Phase 2 gives you the plumbing; Phase 3 makes it scale; Phase 4 is where you combine everything into full AI systems you can defend in an interview.

PHASE 01

Foundations

Weeks 1–2

Language & concurrency primitives, core distributed-systems vocabulary, RPC/gRPC basics. The stuff every other topic depends on.

PHASE 02

Plumbing

Weeks 3–4

Databases (partition, shard, tune), caching, load balancers, rate limiting, idempotency. The stateful machinery.

PHASE 03

Scale & Ops

Weeks 5–6

Kubernetes & GPU autoscaling, CI/CD, observability, DNS/CDN, event-driven architectures with Kafka/RabbitMQ.

PHASE 04

AI Systems

Weeks 7–8

LLM serving (vLLM, SGLang), RAG pipelines, agent infra, semantic caching, evals. Mock interviews end-to-end.

⌃ How to use this doc For each topic: read the summary, study the resources, write down the AI-specific twist, then answer the practice questions out loud as if in an interview. The goal isn't memorization — it's being able to reason about trade-offs when the interviewer says "but what if the traffic spikes 10x?" or "what breaks when you move from one GPU to a fleet?"

01Language skills & concurrency

FOUNDATION AI-RELEVANT

Concurrency is the first filter in HLD interviews. Expect questions on goroutines vs threads, async/await, GIL behaviour, channels, mutexes, context cancellation, and how your language model handles backpressure. For AI systems specifically: async I/O matters enormously because LLM calls are slow (seconds) and parallel tool calls are common.

Core concepts to nail

Primitives

Processes vs threads vs coroutines
Shared memory vs message passing
Mutex, semaphore, RWLock, atomic ops
Channels (Go), Futures (Rust), Promises (JS)
Actor model (Erlang, Akka)
Context cancellation, deadlines, timeouts

Traps

Deadlock, livelock, starvation
Race conditions, memory visibility, reordering
Thread pool exhaustion under slow I/O
Python GIL — when it hurts, when it doesn't
Async-over-sync contamination ("coloured functions")
Goroutine leaks from un-cancelled contexts

⌃ AI Angle Your LangGraph agents fan out to multiple tools in parallel. You need asyncio.gather with per-task timeouts, bounded semaphores to cap concurrent LLM calls (so you don't blow your rate limit), and proper cancellation so a slow tool doesn't hold up the orchestrator. In Go, this is errgroup + context.WithTimeout. Know this pattern cold — it comes up in every agent system design.

Resources

BookConcurrency in Go — Katherine Cox-BudayO'REILLY →
BookDesigning Data-Intensive Applications, Ch. 7 & 8 — Martin KleppmannDDIA →
VideoRob Pike — Concurrency Is Not ParallelismYOUTUBE →
BlogWhat Color Is Your Function? — Bob Nystrom (on async)READ →
DocsPython asyncio patterns & pitfallsDOCS →
Hands-onBuild: concurrent LLM fan-out with timeouts + semaphore + error aggregationYOUR CODE

02CI/CD — depth for AI deployments

FOUNDATION AI-RELEVANT

Classic CI/CD is well-trodden. The AI twist is: what do you test when the output is probabilistic? Canary by eval score, shadow deployments, prompt versioning, model rollbacks, and feature-flagged prompts are all fair game.

Standard pillars

CI concepts

Build, test, lint, SAST, SCA stages
Matrix builds, caching, artefact stores
Trunk-based vs GitFlow
PR-preview environments

CD strategies

Blue/green, canary, rolling, recreate
Feature flags (LaunchDarkly, Unleash)
GitOps with ArgoCD / Flux
Progressive delivery (Flagger, Argo Rollouts)

⌃ AI Angle — what's different A prompt change is a deploy. A model version bump is a deploy. Fine-tune an adapter — deploy. You need: (1) an eval suite that runs in CI (golden datasets, LLM-as-judge, pairwise prefs) and gates promotion; (2) shadow traffic to compare new prompt/model against prod without affecting users; (3) canary by percentage with auto-rollback on quality regression, not just on 5xx rate. This is the modern MLOps/LLMOps delivery loop.

Resources

BookContinuous Delivery — Humble & Farley (the canonical text)SITE →
BlogGoogle SRE Workbook — chapter on canarying releasesREAD →
BlogMartin Fowler — Continuous Delivery for ML (CD4ML)READ →
BlogChip Huyen — CI/CD for machine learningHUYENCHIP →
VideoArgoCD GitOps in 100 seconds & full walkthrough — TechWorld with NanaYOUTUBE →
DocsArgo Rollouts — analysis-based promotionDOCS →
Hands-onBuild: GitHub Actions pipeline that runs LLM evals on PR & blocks merge on regressionYOUR CODE

03Observability — setup, query, visualise

FOUNDATION AI-RELEVANT HIGH-YIELD

The three pillars: metrics, logs, traces. Interviewers want to hear you say Prometheus + Grafana for metrics, OpenTelemetry for traces, Loki / ELK for logs. For AI: you also need per-request token counts, TTFT, p95 generation latency, KV-cache utilisation, and quality metrics (hallucination rate, groundedness). Lookover's whole thesis sits here.

Core stack to know cold

Pillar	Tool	What to practise
Metrics	Prometheus + Grafana	Write PromQL: rate(), histogram_quantile, recording rules, alerts
Traces	OpenTelemetry + Jaeger/Tempo	Propagate trace context across async boundaries, span attributes
Logs	Loki or ELK	Structured JSON logs, correlation IDs, log-to-trace linking
Profiling	Pyroscope / pprof	Continuous profiling in production
LLM-specific	Langfuse, LangSmith, Helicone	Prompt/response tracing, cost, eval runs

⌃ AI Angle — the metrics that matter For LLM inference: TTFT (time-to-first-token), tokens/sec, queue duration, KV cache utilisation, prompt tokens, completion tokens, $/request. vLLM and TGI expose most of these as Prometheus metrics natively. Your Grafana dashboard should answer: "Is latency degrading because the queue is growing, or because prompts are getting longer?"

# Example PromQL for LLM serving
# p95 TTFT by model
histogram_quantile(0.95,
  sum by (le, model) (rate(vllm_time_to_first_token_seconds_bucket[5m]))
)

# GPU memory pressure
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100

# Queue depth (scale trigger)
sum(vllm_num_requests_waiting) by (model)

Resources

BookObservability Engineering — Majors, Fong-Jones, Miranda (Honeycomb)O'REILLY →
BlogZerodha — Monitoring stack with Prometheus, Grafana, VictoriaMetricsREAD →
BlogMonitor LLM inference with Prometheus & Grafana (vLLM, TGI, llama.cpp)READ →
VideoPrometheus + Grafana crash course — TechWorld with NanaYOUTUBE →
DocsOpenTelemetry — concepts & semantic conventions for GenAIDOCS →
DocsLangfuse — open-source LLM observabilityDOCS →
Hands-onBuild: vLLM + Prometheus + Grafana locally; design a dashboard for one modelYOUR CODE

04Databases — partitioning, sharding, tuning

FOUNDATION AI-RELEVANT HIGH-YIELD

This is a multi-topic, high-yield area. You need to distinguish partitioning (logical split within one DB) from sharding (horizontal split across nodes), speak fluently about tuning, and for AI systems specifically — vector databases, hybrid search, and why pgvector-on-Postgres vs a dedicated vector DB matters.

4.1 Partitioning

Strategies

Range (date-based — most common)
List (discrete values: region, tenant)
Hash (uniform distribution)
Composite (range + hash)

Wins

Partition pruning — query only relevant chunks
Parallel query plans
Drop old partitions for archival (fast!)
Index size stays manageable

4.2 Sharding

Key decisions: shard key (avoid hotspots), resharding strategy (consistent hashing vs range splits), cross-shard queries (scatter-gather or avoid). Know Vitess (MySQL), Citus (Postgres), MongoDB native sharding, and how Discord reshards.

4.3 Tuning — the Zerodha playbook

You asked specifically for the Zerodha Postgres blog. Here it is, plus more.

⌃ AI Angle — vector & hybrid search RAG systems need vector search. Know the trade-offs: pgvector (Postgres extension — keeps metadata JOINs easy) vs dedicated engines (Pinecone, Weaviate, Qdrant, Milvus). Understand IVFFlat vs HNSW indexes. Understand hybrid search (BM25 + vector, fused with Reciprocal Rank Fusion). For most startup-scale RAG, pgvector on Postgres is the right answer — and that's a strong, defensible interview take.

Resources

BlogZerodha — Scaling with common sense — Kailash NadhREAD →
BlogZerodha — Working with PostgreSQL (the definitive tuning post)READ →
BlogZerodha — 7M Postgres tables reporting hackREAD →
VideoKailash Nadh — Scaling 7M+ Postgres Tables (talk)YOUTUBE →
BookDesigning Data-Intensive Applications, Ch. 5 & 6 — Martin KleppmannDDIA →
BlogUse the Index, Luke — the practical SQL index guideREAD →
BlogDiscord — How we reshard trillions of messagesREAD →
Docspgvector — HNSW & IVFFlat index docsGITHUB →
PaperHNSW: Efficient & robust approximate nearest neighbour search — Malkov & YashuninARXIV →
Hands-onBuild: partition a time-series table in pg, run EXPLAIN ANALYZE before/afterYOUR CODE

05Kubernetes — deploy, scale, GPUs

FOUNDATION AI-RELEVANT HIGH-YIELD

Classic K8s topics: Deployments, Services, Ingress, ConfigMaps, probes, HPA. For AI: GPU scheduling, MIG (multi-instance GPU), KEDA for scale-to-zero, and DCGM metrics. This is the hottest intersection in interviews right now — GPU-aware autoscaling is a real skill gap in the market.

5.1 Deployment fundamentals

Must-know

Pod / Deployment / StatefulSet / DaemonSet
Service types: ClusterIP, NodePort, LoadBalancer
Ingress + ingress controllers (nginx, traefik)
Liveness, readiness, startup probes
Resource requests vs limits, QoS classes
RollingUpdate strategy, maxSurge, maxUnavailable

Production concerns

Pod Disruption Budgets (PDBs)
PodAntiAffinity for HA
NetworkPolicies for zero-trust
ServiceAccounts + RBAC
Secrets (sealed-secrets or external-secrets)
Graceful shutdown + preStop hooks

5.2 Minikube for local dev

For practice, minikube or kind is sufficient — but if you want to touch GPUs locally, use kind with the NVIDIA device plugin or just run vLLM in Docker with --gpus all. For production-like labs, GKE, EKS and AKS all have free-tier-ish GPU nodes.

5.3 Autoscaling — the deep cut

HPA scales on CPU/memory. For AI, that's useless. Your LLM pod is GPU-bound and queue-bound. You need KEDA with Prometheus-based triggers.

⌃ AI Angle — scale on queue depth, not CPU The 2026 interview answer for "how do you scale LLM inference?" is: HPA on custom metrics (vllm_num_requests_waiting, TTFT, GPU utilisation via DCGM exporter) or KEDA to scale to zero during idle windows. Combine with MIG on A100/H100 for multi-tenant isolation. For cold starts, pre-pull images and cache model weights on a PVC. Know these numbers: 7B model cold start ≈ 30–60s with cached weights; 70B ≈ 2–5 min.

# KEDA ScaledObject — scale vLLM from 0–8 on queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 0         # scale to zero when idle
  maxReplicaCount: 8
  cooldownPeriod: 300
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_depth
      query: sum(vllm_num_requests_waiting)
      threshold: "5"

Resources

BookKubernetes in Action — Marko Lukša (the definitive book)MANNING →
VideoKubernetes full course — TechWorld with Nana (free, 4+ hours)YOUTUBE →
BlogDeploying LLMs on Kubernetes: vLLM, Ray Serve & GPU scheduling (2026)READ →
BlogAutoscaling K8s GPU workloads — a complete production guideMEDIUM →
BlogAuto-scaling GPU inference pods with KEDA + cost guardsREAD →
DocsKEDA — scalers catalogue (Prometheus, Kafka, HTTP, RabbitMQ…)DOCS →
DocsNVIDIA DCGM exporter — GPU metrics for PrometheusGITHUB →
Hands-onBuild: deploy vLLM on minikube/kind with HPA on a custom metricYOUR CODE

06Load balancers

FOUNDATIONAI-RELEVANT

L4 vs L7. Algorithms (round-robin, least-conns, weighted, consistent hashing, EWMA). Health checks. Session affinity. For AI: least-request balancing is rarely right for LLM servers — you need KV-cache-aware routing because directing a conversation continuation to a worker that already has the prefix cached is an order-of-magnitude latency win.

Core concepts

Types

L4 (TCP/UDP) — HAProxy, AWS NLB
L7 (HTTP) — nginx, Envoy, Traefik, AWS ALB
DSR (direct server return)
GSLB — global/geo-based (covered in DNS)
Service mesh sidecar LB (Envoy via Istio/Linkerd)

Algorithms

Round-robin, weighted round-robin
Least connections, least response time
IP hash / consistent hashing (sticky)
Power-of-two-choices (P2C)
EWMA (exponentially-weighted moving average)

⌃ AI Angle — prefix-aware routing Modern LLM routers (vLLM Production Stack, llm-d, NVIDIA Dynamo) implement KV-cache-aware routing: hash the prompt prefix and prefer a worker whose cache already holds it. Combined with prefill/decode disaggregation (some workers do the one-shot prefill, others do token-by-token decode), this is the cutting edge of LLM load balancing. If you can explain this in an interview, you'll stand out.

Resources

BlogCloudflare — Load balancing at the edge (technical deep dive)BLOG →
BlogNetflix — Rethinking Netflix's Edge Load BalancingREAD →
VideoSystem Design — L4 vs L7 Load Balancers — ByteByteGoYOUTUBE →
DocsEnvoy Proxy — HTTP load balancing configurationDOCS →
BlogThe New Stack — Six frameworks for efficient LLM inferencing (covers routing)READ →
PaperThe Power of Two Choices in Randomized Load Balancing — MitzenmacherPDF →

07Rate limiting — leaky & token bucket

FOUNDATIONAI-RELEVANT

Know all five classic algorithms: fixed window, sliding window, sliding window log, token bucket, leaky bucket. Know where you apply them (client, edge, service, DB). For AI systems, rate-limit by tokens, not just requests — otherwise one 100k-token prompt can starve a thousand small ones.

Algorithm cheatsheet

Algorithm	Gist	Pros / Cons
Fixed window	Counter per time bucket	Simple; burst at bucket edges
Sliding window log	Store every timestamp	Accurate; high memory
Sliding window counter	Weighted average of adjacent windows	Good balance — most common
Token bucket	Tokens refill at rate `r`, burst to `b`	Allows controlled bursts
Leaky bucket	Queue drained at constant rate	Smooths output; may add latency

⌃ AI Angle — token-based limits OpenAI, Anthropic, Google all rate-limit on tokens per minute (TPM) in addition to requests per minute (RPM). Your proxy needs to pre-estimate token count from the prompt and reserve capacity. Design a fair-queueing scheme so a big prompt doesn't monopolise — weighted fair queueing with token-cost weights. Also: implement exponential backoff with jitter for 429s from upstream providers.

Resources

BlogStripe — Scaling your API with rate limitersREAD →
BlogFigma — An alternative approach to rate limitingREAD →
VideoRate Limiting Fundamentals — System Design Interview (Alex Xu/ByteByteGo)YOUTUBE →
BlogCloudflare — How we built rate limiting capable of scaling to millionsREAD →
DocsOpenAI cookbook — how to handle rate limitsCOOKBOOK →
Hands-onBuild: token-bucket rate limiter in Redis that rate-limits by estimated LLM tokensYOUR CODE

08Idempotency

FOUNDATIONAI-RELEVANT

Idempotency keys, dedup windows, retries with exponential backoff + jitter, at-least-once vs exactly-once semantics. Every payment system you've built at DodoPayments lives or dies by this. For AI: agents retry tool calls constantly, and a non-idempotent "send email" tool is a disaster waiting to happen.

Key patterns

Producer side

Generate idempotency-key on the client
Include in request header (e.g. Idempotency-Key)
Retry with same key on failure

Consumer side

Dedup store (Redis, DynamoDB) with TTL
Unique constraint at DB level as backstop
Outbox pattern for transactional publishing
Transactional inbox for consumers

⌃ AI Angle — tool-call idempotency When an agent calls send_email(to=X, subject=Y, body=Z) and times out, did the email send? Design your tool interface so every call takes an idempotency-key derived from the agent's thought-hash. The orchestrator dedupes on this. This is crucial for your LangGraph agents — the DodoPayments Refund Orchestrator cannot double-refund, ever.

Resources

BlogStripe — Designing robust and predictable APIs with idempotencyREAD →
BlogBrandur Leach — Implementing Stripe-like idempotency keys in PostgresREAD →
BlogMicroservices.io — Transactional outbox & inbox patternsREAD →
VideoDesigning for failure: exactly-once semantics explained — Arjan Codes / ByteByteGoYOUTUBE →

09Caching & invalidation

FOUNDATIONAI-RELEVANTHIGH-YIELD

Read-through, write-through, write-behind, cache-aside. TTL vs LRU vs LFU eviction. Stampede (thundering herd) prevention. For AI: two massive wins — prompt caching (Anthropic & OpenAI both expose it) and semantic caching (embed the query, look up near-matches, return cached response if cosine similarity > threshold).

9.1 Strategies

Pattern	Flow	Trade-off
Cache-aside (lazy)	App checks cache → miss → load from DB → populate cache	Simple; first request slow
Read-through	Cache loads from DB on miss transparently	Client doesn't know about DB
Write-through	Write to cache & DB synchronously	Slow writes; strong consistency
Write-behind (back)	Write to cache; flush async to DB	Fast writes; risk on cache loss
Refresh-ahead	Cache proactively refreshes before TTL	Hides latency; may over-fetch

9.2 Invalidation — the hard part

The two hardest things in CS are cache invalidation, naming things, and off-by-one errors. Strategies: TTL (lazy), event-driven invalidation (publish change events), versioned keys (bump version on write), write-through (trivially consistent but slow).

⌃ AI Angle — three caches that matter 1. Prompt caching — Anthropic/OpenAI cache repeated prompt prefixes, cutting cost & latency. Use it for system prompts + long context docs.
2. KV-cache reuse — at the serving layer (vLLM PagedAttention), tokens you've already seen don't need recomputation.
3. Semantic caching — embed the user query, check vector store for a near-match past response. Ship only if similarity > 0.95 and the cached answer is still fresh. Watch out: semantic cache poisoning is real.

Resources

BookDesigning Data-Intensive Applications, Ch. 3 — Martin KleppmannDDIA →
BlogFacebook — Scaling Memcache at Facebook (classic paper-blog)READ →
BlogRedis — Client-side caching & invalidationDOCS →
BlogAnthropic — Prompt caching (official guide)DOCS →
BlogSemantic caching in LLM pipelines — Redis blogREAD →
VideoCache patterns explained — ByteByteGoYOUTUBE →
Hands-onBuild: semantic cache with Redis + pgvector, measure hit rate on real promptsYOUR CODE

10Distributed systems

FOUNDATIONAI-RELEVANTHIGH-YIELD

The broadest topic. CAP, PACELC, consistency models, consensus (Raft, Paxos), leader election, replication, 2PC/3PC/sagas, vector clocks, CRDTs. This is the "speak the language fluently" topic — you won't be asked to implement Raft, but you must reason about what breaks when a network partitions during your RAG write.

Mental models to own

Foundational

CAP & PACELC theorem
Consistency: linearizable, sequential, causal, eventual
Consensus: Raft, Paxos, ZAB (roughly, when to use)
Leader election vs leaderless (Dynamo-style)
Quorum reads/writes (W + R > N)

Practical

Sagas (orchestration vs choreography)
Two-phase commit & why it's rare
Outbox pattern & change data capture (CDC)
Distributed tracing & clock skew (Lamport, vector clocks)
Partition tolerance strategies: retries, hedging, fallback

⌃ AI Angle — your agents are distributed systems A multi-step LangGraph agent running 4 tool calls across 3 external APIs is a distributed system. You'll be asked: what happens if tool call 3 of 4 succeeds but the agent crashes before committing state? Answer: durable execution (Temporal, Restate, AWS Step Functions) or your own checkpointing. For multi-agent systems, you have a consensus problem: which agent's answer wins? Know this vocabulary.

Resources

BookDesigning Data-Intensive Applications — Martin Kleppmann (the entire book)DDIA →
BookUnderstanding Distributed Systems — Roberto VitilloSITE →
VideoMIT 6.824 Distributed Systems lectures — Robert Morris (free on YT)YOUTUBE →
PaperIn Search of an Understandable Consensus Algorithm (Raft)PDF →
PaperDynamo: Amazon's Highly Available Key-value StorePDF →
BlogJepsen — consistency analyses (the gold standard)JEPSEN →
DocsTemporal — durable execution for agents & workflowsDOCS →

11Event-driven architectures — Kafka & RabbitMQ

FOUNDATIONAI-RELEVANT

Kafka and RabbitMQ solve different problems. Kafka is a distributed log — durable, replayable, high throughput, good for streams and CDC. RabbitMQ is a message broker — flexible routing, lower throughput, good for task queues. For AI workloads: Kafka for ingestion & evaluation streams, RabbitMQ (or SQS) for inference task queues.

Core differences

	Kafka	RabbitMQ
Model	Distributed log (consumer pull)	Broker (push, routing)
Ordering	Per-partition	Per-queue (with caveats)
Throughput	100k–1M+ msg/s	10k–50k msg/s
Retention	Days/weeks/forever	Until consumed
Replay	Yes, native	No (dead-letter workaround)
Best fit	Event streaming, CDC, analytics, audit logs	Task queues, work distribution, RPC-ish

⌃ AI Angle — where each fits Kafka: stream all LLM request/response pairs for offline evaluation & fine-tuning dataset building. Use Kafka Streams or Flink for real-time drift detection on embeddings. RabbitMQ/SQS: async document ingestion for RAG (user uploads PDF → queue → worker chunks + embeds + stores), and long-running inference jobs (image gen, batch transcription). KEDA can scale both Kafka and RabbitMQ consumers natively.

Resources

BookKafka: The Definitive Guide — Shapira et al. (O'Reilly)O'REILLY →
BlogConfluent — Kafka fundamentals & design patternsCONFLUENT →
VideoApache Kafka in 6 minutes + deep dives — ByteByteGoYOUTUBE →
BlogRabbitMQ vs Kafka — when to use which — Jack VanlightlyREAD →
DocsRabbitMQ — work queues tutorialDOCS →
BlogUber — Real-time data infrastructure with KafkaREAD →
Hands-onBuild: RAG ingestion pipeline — upload → RabbitMQ → worker → pgvectorYOUR CODE

12DNS & CDN

FOUNDATIONAI-RELEVANT

DNS: recursive resolvers, authoritative, TTL, DNS-based load balancing (GeoDNS), anycast. CDNs: edge vs origin, cache-control, origin shield, signed URLs, Workers/edge functions. For AI: edge inference is becoming real (Cloudflare Workers AI, Vercel AI SDK on edge). Know it.

Essentials

DNS

Record types (A, AAAA, CNAME, MX, TXT, SRV)
Recursive vs iterative resolution
TTL trade-offs (low = flexibility, high = resilience)
GeoDNS / latency-based routing (Route53, NS1)
Anycast for global presence

CDN

Cache-Control, Surrogate-Control, ETag
Origin shield, tiered caching
Stale-while-revalidate, stale-if-error
Signed URLs for private assets
Edge functions (Cloudflare Workers, CloudFront Functions)

⌃ AI Angle — edge inference & regional serving CDNs now run LLMs at the edge: Cloudflare Workers AI, Vercel AI Gateway, AWS Bedrock with regional endpoints. For global apps, route users to the closest model region via latency-based DNS. Cache embeddings at the edge (they're small, static, cacheable). Cache model responses behind a Vary: Authorization header. These are all high-signal details in a senior interview.

Resources

VideoDNS explained in depth — Julia Evans (zines + blog)READ →
BlogCloudflare Learning Center — DNS, CDN, anycast (free, excellent)LEARN →
BlogHigh Scalability — How CDNs work at scaleBLOG →
DocsCloudflare Workers AI — LLMs at the edgeDOCS →
VideoCDN Design — ByteByteGoYOUTUBE →

13RPC & gRPC

FOUNDATIONAI-RELEVANT

REST vs gRPC vs GraphQL vs tRPC. Know when gRPC wins: internal service-to-service, streaming, strict typing, polyglot. Protobuf schema evolution. Interceptors, deadlines, metadata. For AI: server-sent events (SSE) and gRPC streams for token streaming; Model Context Protocol (MCP) is the new standard for tool calls.

REST vs gRPC — when to use what

Criterion	REST	gRPC
Transport	HTTP/1.1 or /2, JSON	HTTP/2, Protobuf (binary)
Typing	OpenAPI (optional)	Strict, via .proto
Streaming	SSE or WebSockets	Native bi-di streams
Browser	First class	Needs gRPC-Web proxy
Best fit	Public APIs, browser clients	Internal microservices, low-latency

⌃ AI Angle — streaming protocols for LLMs Token-by-token streaming is non-negotiable for UX. Three options: SSE (simple, HTTP-compatible, browser-friendly — the OpenAI/Anthropic default), WebSockets (bi-directional, good for voice/interrupt), gRPC streaming (internal service mesh). For tool-calling, MCP (Model Context Protocol) is Anthropic's open standard — worth reading their spec, it's essentially a structured RPC layer for LLM tools.

Resources

BookgRPC: Up and Running — Kasun Indrasiri (O'Reilly)O'REILLY →
VideogRPC vs REST — which one should you use? — ByteByteGoYOUTUBE →
DocsgRPC official — concepts, streaming, interceptorsDOCS →
DocsModel Context Protocol (MCP) — Anthropic specMCP →
BlogNetflix — gRPC at Netflix (service mesh + observability)READ →

14AI add-on — serving, RAG, agents (must cover)

AI-NATIVECRITICAL

You said "tweak the plan around AI" — this entire section is the tweak. GenAI/LLM system design is now a standalone interview category at OpenAI, Anthropic, Google, Meta, and every startup hiring AI engineers. Three sub-topics: LLM serving, RAG, and agents.

14.1 LLM Inference & Serving

Know the landscape: vLLM (open-source, PagedAttention, highest throughput), TensorRT-LLM (NVIDIA, best perf on their hardware), Hugging Face TGI (ecosystem integration), SGLang (structured generation, prefix caching), llama.cpp / Ollama (local & edge). Key concepts: continuous batching, PagedAttention, speculative decoding, tensor parallelism, prefill/decode disaggregation.

14.2 RAG — Retrieval-Augmented Generation

End-to-end pipeline: parse → chunk → embed → store → retrieve → rerank → prompt → generate. For each stage, know 2–3 options and their trade-offs. Chunking strategy is where most RAG systems die — semantic chunking beats fixed-size for most document types but costs more. Always implement hybrid search (BM25 + vector) + reranking.

14.3 Agents & Tool-Use

An agent is an LLM with an execution loop, tools, memory, and guardrails. The LLM is ~20% of the system — the rest is infrastructure: orchestrator (LangGraph, custom), tool registry, sandbox, policy engine, observability. This is squarely in Lookover's wheelhouse — and your Claude dossier on the LangGraph automation stack reflects exactly the right mental model.

⌃ The must-know system designs
1. "Design an LLM chatbot with RAG over 10M docs."
2. "Design the inference serving layer for a popular open-source LLM."
3. "Design an AI agent that can take actions on behalf of users safely."
4. "Design an eval & observability platform for production LLM apps." (literally Lookover)
5. "Design a semantic cache for an LLM API proxy."

You should be able to whiteboard each of these, naming components, trade-offs, and failure modes.

Resources — essential

BookDesigning Machine Learning Systems — Chip Huyen (O'Reilly, 2022)O'REILLY →
BookAI Engineering: Building Applications with Foundation Models — Chip Huyen (O'Reilly, 2024)O'REILLY →
BlogChip Huyen — huyenchip.com (entire blog)BLOG →
BlogEugene Yan — Patterns for building LLM-based systems & productsREAD →
BlogLilian Weng — LLM Powered Autonomous Agents (the canonical post)READ →
BlogAnthropic — Building effective agentsANTHROPIC →
BlogGenerative AI System Design Interview Guide 2026 — PracHubREAD →
BlogIGotAnOffer — GenAI system design interview (examples & framework)READ →
BlogAgentic AI System Design Interview Guide 2026MEDIUM →
DocsvLLM documentation — PagedAttention, continuous batching, servingDOCS →
BlogBest LLM Inference Engines 2026 — vLLM, TensorRT-LLM, TGI, SGLangREAD →
PaperEfficient Memory Management for LLM Serving with PagedAttention — Kwon et al. (vLLM paper)ARXIV →
PaperRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.ARXIV →
PaperReAct: Synergizing Reasoning and Acting in Language ModelsARXIV →
VideoAndrej Karpathy — Let's build a GPT + Deep Dive into LLMsYOUTUBE →
VideoFull-stack LLM Bootcamp — Charles Frye et al. (free)FSDL →
Hands-onBuild: end-to-end RAG on your Lookover compliance docs — measure precision@kYOUR CODE

Week-by-week / Plan

The 8-week schedule

Aim for ~10 hours a week: 5 hours reading/watching, 3 hours hands-on building, 2 hours mock interviewing (out loud, whiteboard on paper). If 10h/week is too much, stretch to 12 weeks — don't compress below 8.

WEEK 01 · FOUNDATIONS

Language & concurrency

Read DDIA ch. 7–8
Watch Rob Pike concurrency
Build: concurrent LLM fan-out
Mock: "design a URL shortener"

WEEK 02 · FOUNDATIONS

Distributed systems + RPC

Start MIT 6.824 lectures 1–4
Read CAP, PACELC, Dynamo paper
gRPC tutorial + MCP spec
Mock: "design a chat system"

WEEK 03 · PLUMBING

Databases deep-dive

Zerodha Postgres blog trio
DDIA ch. 5–6 (replication, partitioning)
pgvector hands-on
Mock: "design news feed storage"

WEEK 04 · PLUMBING

Caching, LB, rate limits, idempotency

Stripe rate limits & idempotency posts
Facebook memcache paper
Build: semantic cache prototype
Mock: "design payment processor"

WEEK 05 · SCALE

Kubernetes & GPU autoscaling

Nana K8s crash course
PreMAI LLM-on-K8s guide
Build: vLLM on minikube + HPA
Mock: "design YouTube video upload"

WEEK 06 · SCALE

CI/CD, observability, DNS/CDN, events

Martin Fowler CD4ML
Observability Engineering (skim)
Kafka fundamentals
Mock: "design Twitter/X timeline"

WEEK 07 · AI

LLM serving + RAG

Chip Huyen — AI Engineering ch. 4–8
vLLM paper + docs
Build: RAG over your own docs
Mock: "design a doc-QA chatbot"

WEEK 08 · AI

Agents + end-to-end mocks

Lilian Weng agents post
Anthropic building-effective-agents
5× 45-min mock interviews
Write your own Lookover design doc

Practice / Self-test

30 practice problems

Drill these over weeks 4–8. Pick one, give yourself 45 minutes, whiteboard it, talk it out loud (record yourself if possible), then compare against a reference answer. The AI-specific ones marked ⌃ are the highest-yield for the roles you're targeting.

Classic HLD warm-ups (weeks 1–4)

Design a URL shortener (bit.ly). Handle 100B URLs.
Design a distributed rate limiter using Redis. Support sliding window & token bucket.
Design a news feed system (Twitter/X style).
Design a payment processor with strict idempotency.
Design a typeahead/autocomplete service.
Design a notification system (email + push + SMS, retries, dedup).
Design a metrics/monitoring system like Datadog at small scale.
Design a distributed cache (write-through, invalidation, consistency).
Design a job scheduler like cron-at-scale.
Design a chat system with read receipts & presence.

AI-focused designs (weeks 5–8) ⌃ HIGH YIELD

Design an LLM inference serving layer for an open-source 70B model. Target 1000 concurrent users, p95 TTFT < 1s.
Design a RAG system over 10M enterprise documents. Handle daily updates.
Design a semantic cache in front of the OpenAI API. Target 30% cost reduction.
Design an agentic workflow orchestrator (Temporal-style) for LLM agents with tool use.
Design an LLM evaluation & observability platform (i.e. Lookover). Cover tracing, evals, alerts.
Design a multi-tenant vector search service with per-tenant isolation and quotas.
Design a fine-tuning pipeline: data prep → training job → eval → deploy.
Design a prompt management system with versioning, A/B testing, and rollback.
Design an image-generation service (Midjourney-lite). Handle queues, priority tiers, NSFW filtering.
Design an AI-powered code review bot that scales to 1000 repos.
Design a real-time voice assistant with sub-500ms round-trip latency.
Design a GPU cluster autoscaler that balances cost vs latency for fluctuating traffic.
Design a guardrails system that sits between users & an LLM. PII redaction, jailbreak detection, output filtering.
Design an LLM gateway/proxy with rate limiting, key rotation, and cost tracking across providers.
Design an EU AI Act compliance evidence-collection pipeline (hi there, Lookover).
Design an embeddings refresh system — detect drift, re-embed, migrate indexes without downtime.
Design a distributed training job scheduler for multi-node LLM fine-tunes.
Design an LLM-as-a-service API platform (OpenRouter/Fireworks style).
Design an AI workflow testing framework — deterministic replay of LLM interactions.
Design a model registry with lineage, evals, and staged promotion (staging → canary → prod).

⌃ How to structure every answer (1) Clarify: scale, latency, cost budget, deterministic vs probabilistic. (2) Functional & non-functional requirements. (3) Back-of-envelope math. (4) High-level diagram — data plane + control plane. (5) Deep-dive one or two components. (6) Discuss trade-offs and failure modes. (7) How you'd monitor & evaluate it. Always end with "what would you want me to go deeper on?"

Closing / Meta

Three rules that'll separate you

1 — Always quantify

"~1000 QPS" not "high traffic"
"p95 latency < 200ms" not "fast"
"$2/1k tokens input, $6/1k output" not "expensive"
"KV cache ≈ 2 × num_layers × hidden_size × seq_len × bytes" → know approximate memory for 7B / 70B

2 — Name trade-offs explicitly

"Using semantic cache — trades freshness for cost & latency"
"Prefill/decode disaggregation — more complex but better throughput"
"KEDA scale-to-zero — saves cost but adds cold start"
"Hybrid search — better recall but higher latency and cost"

3 — Ground it in your actual work You're building Lookover and an AI compliance agency. When the interviewer asks a design question, pull from real experience: "At Lookover, I handle this with…", "For the DodoPayments orchestrator, we used idempotency keys because…". Specificity > generic textbook answers, every single time. Your production experience is your edge over textbook-primed candidates.

High-Level Design.For AI-native engineers.

Table of contents

The plan, in four phases

Core concepts to nail

Primitives

Traps

Resources

Standard pillars

CI concepts

CD strategies

Resources

Core stack to know cold

Resources

4.1 Partitioning

Strategies

Wins

4.2 Sharding

4.3 Tuning — the Zerodha playbook

Resources

5.1 Deployment fundamentals

Must-know

Production concerns

5.2 Minikube for local dev

5.3 Autoscaling — the deep cut

Resources

Core concepts

Types

Algorithms

Resources

Algorithm cheatsheet

Resources

Key patterns

Producer side

Consumer side

Resources

9.1 Strategies

9.2 Invalidation — the hard part

Resources

Mental models to own

Foundational

Practical

Resources

Core differences

Resources

Essentials

DNS

CDN

Resources

REST vs gRPC — when to use what

Resources

14.1 LLM Inference & Serving

14.2 RAG — Retrieval-Augmented Generation

14.3 Agents & Tool-Use

Resources — essential

The 8-week schedule

30 practice problems

Classic HLD warm-ups (weeks 1–4)

AI-focused designs (weeks 5–8) ⌃ HIGH YIELD

Three rules that'll separate you

1 — Always quantify

2 — Name trade-offs explicitly

High-Level Design.
For AI-native engineers.