Two months of research and engineering to design, build, and evaluate a production‑grade, event‑driven LLM inference platform on Kubernetes with WebSocket token streaming, Redis messaging, and autoscaling GPU workers.
I present a cloud‑native architecture that decouples request ingress, inference execution, and token delivery to support high concurrency with predictable tail latency. The system uses a FastAPI gateway, Redis for job buffering and fan‑out, dedicated GPU workers, and Kubernetes HPA based on CPU and GPU metrics. We evaluate throughput, p95 latency, and cost per 1k tokens under synthetic and trace‑driven loads, and we document failure handling and recovery characteristics.
Event‑driven pipeline with Redis queue + Pub/Sub separates latency‑sensitive streaming from GPU‑bound compute. Stateless gateways and workers enable independent scaling.
Kubernetes deployments with HPA; GPU workers with resource requests/limits; WebSocket token streaming; structured logging and request IDs end‑to‑end.
Load generation with step, burst, and trace profiles; measurements of p50/p95/p99, throughput, GPU utilization, and cost.
Graceful drain, idempotent workers, retry semantics, dead‑letter queue option, health probes, and backpressure signals.
Namespace isolation, per‑service ServiceAccounts, least‑privilege RBAC, Secrets for tokens, network policies for intra‑cluster traffic.
Single command bootstrap, infra as code, manifests and scripts, and seed data for benchmarking.
A distributed, asynchronous design that separates ingress, messaging, compute, and streaming.
FastAPI, request admission, job creation, and WebSocket streaming. Stateless for horizontal scale.
Queue buffers spikes. Pub/Sub fans out tokens in real time. Optional dead‑letter queue for failures.
Pull jobs, run inference, stream tokens. Resource requests/limits enforce scheduling on GPU nodes.
HPA watches metrics and adjusts worker replicas to track demand.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-worker-deployment
spec:
replicas: 2
selector:
matchLabels: { app: llama-worker }
template:
metadata:
labels: { app: llama-worker }
spec:
containers:
- name: llama-worker
image: your-registry/llama-inference-worker:latest
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
nodeSelector:
nvidia.com/gpu.present: "true"
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llama-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama-worker-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 85
We measure latency, throughput, GPU utilization, and cost under multiple load profiles.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
envFrom:
- configMapRef: { name: llama-config }
- secretRef: { name: llama-secrets }
We estimate cost per 1k tokens as a function of GPU seconds, gateway CPU seconds, and Redis I/O ops. Autoscaling lowers idle GPU minutes; queue‑aware batching increases useful GPU seconds per job.
# 1) Deploy core services
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/redis.yaml
kubectl apply -f k8s/gateway.yaml
kubectl apply -f k8s/worker-gpu.yaml
# 2) Enable autoscaling
kubectl apply -f k8s/worker-hpa.yaml
# 3) Port-forward or expose gateway
kubectl port-forward svc/llama-api-gateway 8000:80
kubectl top pods -n llm # requires metrics-server
kubectl get hpa -n llm
kubectl describe hpa llama-worker-hpa -n llm
kubectl get events --sort-by=.lastTimestamp -n llm
Key studies and technical works on Kubernetes-based, autoscaled LLM serving.
Introduces a hierarchical autoscaling mechanism based on SLO-aware backpressure that improves GPU efficiency significantly.
Presents a deployment, monitoring, and autoscaling framework for stable multi-GPU LLM serving with improved cost and QoS.
Explores containerized, autoscaled architectures that dynamically adapt to workload fluctuations in real LLM inference workloads.
Describes a modular, cache-aware, routing-optimized serving framework built on Kubernetes, vLLM, and inference gateways.
Explains how serverless, scale-to-zero inference can be achieved on Kubernetes using KFServing and Knative.