A structured 8-week prep plan covering the 13 core HLD topics you listed — each one reframed through the lens of modern AI infrastructure: LLM inference, RAG pipelines, agentic systems, GPU autoscaling, and observability for non-deterministic workloads. Every topic has curated resources (books, blogs, YouTube, papers, docs), an AI-specific angle, and interview questions to self-test against.
Treat this like a curriculum. Don't jump ahead — each phase assumes comfort with the previous. Phase 1 gives you the vocabulary; Phase 2 gives you the plumbing; Phase 3 makes it scale; Phase 4 is where you combine everything into full AI systems you can defend in an interview.
Concurrency is the first filter in HLD interviews. Expect questions on goroutines vs threads, async/await, GIL behaviour, channels, mutexes, context cancellation, and how your language model handles backpressure. For AI systems specifically: async I/O matters enormously because LLM calls are slow (seconds) and parallel tool calls are common.
asyncio.gather with per-task timeouts, bounded semaphores to cap concurrent LLM calls (so you don't blow your rate limit), and proper cancellation so a slow tool doesn't hold up the orchestrator. In Go, this is errgroup + context.WithTimeout. Know this pattern cold — it comes up in every agent system design.
Classic CI/CD is well-trodden. The AI twist is: what do you test when the output is probabilistic? Canary by eval score, shadow deployments, prompt versioning, model rollbacks, and feature-flagged prompts are all fair game.
The three pillars: metrics, logs, traces. Interviewers want to hear you say Prometheus + Grafana for metrics, OpenTelemetry for traces, Loki / ELK for logs. For AI: you also need per-request token counts, TTFT, p95 generation latency, KV-cache utilisation, and quality metrics (hallucination rate, groundedness). Lookover's whole thesis sits here.
| Pillar | Tool | What to practise |
|---|---|---|
| Metrics | Prometheus + Grafana | Write PromQL: rate(), histogram_quantile, recording rules, alerts |
| Traces | OpenTelemetry + Jaeger/Tempo | Propagate trace context across async boundaries, span attributes |
| Logs | Loki or ELK | Structured JSON logs, correlation IDs, log-to-trace linking |
| Profiling | Pyroscope / pprof | Continuous profiling in production |
| LLM-specific | Langfuse, LangSmith, Helicone | Prompt/response tracing, cost, eval runs |
TTFT (time-to-first-token), tokens/sec, queue duration, KV cache utilisation, prompt tokens, completion tokens, $/request. vLLM and TGI expose most of these as Prometheus metrics natively. Your Grafana dashboard should answer: "Is latency degrading because the queue is growing, or because prompts are getting longer?"
# Example PromQL for LLM serving
# p95 TTFT by model
histogram_quantile(0.95,
sum by (le, model) (rate(vllm_time_to_first_token_seconds_bucket[5m]))
)
# GPU memory pressure
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100
# Queue depth (scale trigger)
sum(vllm_num_requests_waiting) by (model)
This is a multi-topic, high-yield area. You need to distinguish partitioning (logical split within one DB) from sharding (horizontal split across nodes), speak fluently about tuning, and for AI systems specifically — vector databases, hybrid search, and why pgvector-on-Postgres vs a dedicated vector DB matters.
Key decisions: shard key (avoid hotspots), resharding strategy (consistent hashing vs range splits), cross-shard queries (scatter-gather or avoid). Know Vitess (MySQL), Citus (Postgres), MongoDB native sharding, and how Discord reshards.
You asked specifically for the Zerodha Postgres blog. Here it is, plus more.
pgvector (Postgres extension — keeps metadata JOINs easy) vs dedicated engines (Pinecone, Weaviate, Qdrant, Milvus). Understand IVFFlat vs HNSW indexes. Understand hybrid search (BM25 + vector, fused with Reciprocal Rank Fusion). For most startup-scale RAG, pgvector on Postgres is the right answer — and that's a strong, defensible interview take.
Classic K8s topics: Deployments, Services, Ingress, ConfigMaps, probes, HPA. For AI: GPU scheduling, MIG (multi-instance GPU), KEDA for scale-to-zero, and DCGM metrics. This is the hottest intersection in interviews right now — GPU-aware autoscaling is a real skill gap in the market.
For practice, minikube or kind is sufficient — but if you want to touch GPUs locally, use kind with the NVIDIA device plugin or just run vLLM in Docker with --gpus all. For production-like labs, GKE, EKS and AKS all have free-tier-ish GPU nodes.
HPA scales on CPU/memory. For AI, that's useless. Your LLM pod is GPU-bound and queue-bound. You need KEDA with Prometheus-based triggers.
vllm_num_requests_waiting, TTFT, GPU utilisation via DCGM exporter) or KEDA to scale to zero during idle windows. Combine with MIG on A100/H100 for multi-tenant isolation. For cold starts, pre-pull images and cache model weights on a PVC. Know these numbers: 7B model cold start ≈ 30–60s with cached weights; 70B ≈ 2–5 min.
# KEDA ScaledObject — scale vLLM from 0–8 on queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 8
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_queue_depth
query: sum(vllm_num_requests_waiting)
threshold: "5"
L4 vs L7. Algorithms (round-robin, least-conns, weighted, consistent hashing, EWMA). Health checks. Session affinity. For AI: least-request balancing is rarely right for LLM servers — you need KV-cache-aware routing because directing a conversation continuation to a worker that already has the prefix cached is an order-of-magnitude latency win.
Know all five classic algorithms: fixed window, sliding window, sliding window log, token bucket, leaky bucket. Know where you apply them (client, edge, service, DB). For AI systems, rate-limit by tokens, not just requests — otherwise one 100k-token prompt can starve a thousand small ones.
| Algorithm | Gist | Pros / Cons |
|---|---|---|
| Fixed window | Counter per time bucket | Simple; burst at bucket edges |
| Sliding window log | Store every timestamp | Accurate; high memory |
| Sliding window counter | Weighted average of adjacent windows | Good balance — most common |
| Token bucket | Tokens refill at rate r, burst to b | Allows controlled bursts |
| Leaky bucket | Queue drained at constant rate | Smooths output; may add latency |
Idempotency keys, dedup windows, retries with exponential backoff + jitter, at-least-once vs exactly-once semantics. Every payment system you've built at DodoPayments lives or dies by this. For AI: agents retry tool calls constantly, and a non-idempotent "send email" tool is a disaster waiting to happen.
Idempotency-Key)send_email(to=X, subject=Y, body=Z) and times out, did the email send? Design your tool interface so every call takes an idempotency-key derived from the agent's thought-hash. The orchestrator dedupes on this. This is crucial for your LangGraph agents — the DodoPayments Refund Orchestrator cannot double-refund, ever.
Read-through, write-through, write-behind, cache-aside. TTL vs LRU vs LFU eviction. Stampede (thundering herd) prevention. For AI: two massive wins — prompt caching (Anthropic & OpenAI both expose it) and semantic caching (embed the query, look up near-matches, return cached response if cosine similarity > threshold).
| Pattern | Flow | Trade-off |
|---|---|---|
| Cache-aside (lazy) | App checks cache → miss → load from DB → populate cache | Simple; first request slow |
| Read-through | Cache loads from DB on miss transparently | Client doesn't know about DB |
| Write-through | Write to cache & DB synchronously | Slow writes; strong consistency |
| Write-behind (back) | Write to cache; flush async to DB | Fast writes; risk on cache loss |
| Refresh-ahead | Cache proactively refreshes before TTL | Hides latency; may over-fetch |
The two hardest things in CS are cache invalidation, naming things, and off-by-one errors. Strategies: TTL (lazy), event-driven invalidation (publish change events), versioned keys (bump version on write), write-through (trivially consistent but slow).
The broadest topic. CAP, PACELC, consistency models, consensus (Raft, Paxos), leader election, replication, 2PC/3PC/sagas, vector clocks, CRDTs. This is the "speak the language fluently" topic — you won't be asked to implement Raft, but you must reason about what breaks when a network partitions during your RAG write.
Kafka and RabbitMQ solve different problems. Kafka is a distributed log — durable, replayable, high throughput, good for streams and CDC. RabbitMQ is a message broker — flexible routing, lower throughput, good for task queues. For AI workloads: Kafka for ingestion & evaluation streams, RabbitMQ (or SQS) for inference task queues.
| Kafka | RabbitMQ | |
|---|---|---|
| Model | Distributed log (consumer pull) | Broker (push, routing) |
| Ordering | Per-partition | Per-queue (with caveats) |
| Throughput | 100k–1M+ msg/s | 10k–50k msg/s |
| Retention | Days/weeks/forever | Until consumed |
| Replay | Yes, native | No (dead-letter workaround) |
| Best fit | Event streaming, CDC, analytics, audit logs | Task queues, work distribution, RPC-ish |
DNS: recursive resolvers, authoritative, TTL, DNS-based load balancing (GeoDNS), anycast. CDNs: edge vs origin, cache-control, origin shield, signed URLs, Workers/edge functions. For AI: edge inference is becoming real (Cloudflare Workers AI, Vercel AI SDK on edge). Know it.
REST vs gRPC vs GraphQL vs tRPC. Know when gRPC wins: internal service-to-service, streaming, strict typing, polyglot. Protobuf schema evolution. Interceptors, deadlines, metadata. For AI: server-sent events (SSE) and gRPC streams for token streaming; Model Context Protocol (MCP) is the new standard for tool calls.
| Criterion | REST | gRPC |
|---|---|---|
| Transport | HTTP/1.1 or /2, JSON | HTTP/2, Protobuf (binary) |
| Typing | OpenAPI (optional) | Strict, via .proto |
| Streaming | SSE or WebSockets | Native bi-di streams |
| Browser | First class | Needs gRPC-Web proxy |
| Best fit | Public APIs, browser clients | Internal microservices, low-latency |
You said "tweak the plan around AI" — this entire section is the tweak. GenAI/LLM system design is now a standalone interview category at OpenAI, Anthropic, Google, Meta, and every startup hiring AI engineers. Three sub-topics: LLM serving, RAG, and agents.
Know the landscape: vLLM (open-source, PagedAttention, highest throughput), TensorRT-LLM (NVIDIA, best perf on their hardware), Hugging Face TGI (ecosystem integration), SGLang (structured generation, prefix caching), llama.cpp / Ollama (local & edge). Key concepts: continuous batching, PagedAttention, speculative decoding, tensor parallelism, prefill/decode disaggregation.
End-to-end pipeline: parse → chunk → embed → store → retrieve → rerank → prompt → generate. For each stage, know 2–3 options and their trade-offs. Chunking strategy is where most RAG systems die — semantic chunking beats fixed-size for most document types but costs more. Always implement hybrid search (BM25 + vector) + reranking.
An agent is an LLM with an execution loop, tools, memory, and guardrails. The LLM is ~20% of the system — the rest is infrastructure: orchestrator (LangGraph, custom), tool registry, sandbox, policy engine, observability. This is squarely in Lookover's wheelhouse — and your Claude dossier on the LangGraph automation stack reflects exactly the right mental model.
Aim for ~10 hours a week: 5 hours reading/watching, 3 hours hands-on building, 2 hours mock interviewing (out loud, whiteboard on paper). If 10h/week is too much, stretch to 12 weeks — don't compress below 8.
Drill these over weeks 4–8. Pick one, give yourself 45 minutes, whiteboard it, talk it out loud (record yourself if possible), then compare against a reference answer. The AI-specific ones marked ⌃ are the highest-yield for the roles you're targeting.