Scaling LLM Services to 1 Million Users

Abstract

I present a cloud‑native architecture that decouples request ingress, inference execution, and token delivery to support high concurrency with predictable tail latency. The system uses a FastAPI gateway, Redis for job buffering and fan‑out, dedicated GPU workers, and Kubernetes HPA based on CPU and GPU metrics. We evaluate throughput, p95 latency, and cost per 1k tokens under synthetic and trace‑driven loads, and we document failure handling and recovery characteristics.

Contributions

1. Design

Event‑driven pipeline with Redis queue + Pub/Sub separates latency‑sensitive streaming from GPU‑bound compute. Stateless gateways and workers enable independent scaling.

2. Implementation

Kubernetes deployments with HPA; GPU workers with resource requests/limits; WebSocket token streaming; structured logging and request IDs end‑to‑end.

3. Evaluation

Load generation with step, burst, and trace profiles; measurements of p50/p95/p99, throughput, GPU utilization, and cost.

4. Reliability

Graceful drain, idempotent workers, retry semantics, dead‑letter queue option, health probes, and backpressure signals.

5. Security

Namespace isolation, per‑service ServiceAccounts, least‑privilege RBAC, Secrets for tokens, network policies for intra‑cluster traffic.

6. Reproducibility

Single command bootstrap, infra as code, manifests and scripts, and seed data for benchmarking.

System Architecture

A distributed, asynchronous design that separates ingress, messaging, compute, and streaming.

API Gateway

FastAPI, request admission, job creation, and WebSocket streaming. Stateless for horizontal scale.

Redis Backbone

Queue buffers spikes. Pub/Sub fans out tokens in real time. Optional dead‑letter queue for failures.

GPU Workers

Pull jobs, run inference, stream tokens. Resource requests/limits enforce scheduling on GPU nodes.

Kubernetes Autoscaling in Action

HPA watches metrics and adjusts worker replicas to track demand.

sequenceDiagram participant M as Metrics Server participant HPA as HPA Controller participant D as Deployment participant WPs as Worker Pods (ReplicaSet) Note over M, WPs: High traffic causes high utilization M->>WPs: Scrape Metrics (CPU > 70%) WPs-->>M: Report high utilization HPA->>M: 1. Query pod metrics M-->>HPA: 2. Return high metrics HPA->>D: 3. Threshold exceeded! Update replicas: 2 → 4 D->>WPs: 4. Create 2 new Worker Pods Note right of WPs: System now has more capacity!

Worker Deployment (GPU)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-worker-deployment
spec:
  replicas: 2
  selector:
    matchLabels: { app: llama-worker }
  template:
    metadata:
      labels: { app: llama-worker }
    spec:
      containers:
        - name: llama-worker
          image: your-registry/llama-inference-worker:latest
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
      nodeSelector:
        nvidia.com/gpu.present: "true"

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-worker-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: nvidia.com/gpu
        target:
          type: Utilization
          averageUtilization: 85

Evaluation & Methods

We measure latency, throughput, GPU utilization, and cost under multiple load profiles.

Load Profiles

Step: 50 → 200 RPS in 5 min
Burst: 0 → 400 RPS in 10s, hold, decay
Trace: replayed inter‑arrival times from real sessions

Metrics

Latency: p50/p95/p99 end‑to‑end
Throughput: tokens/sec, requests/sec
GPU: utilization %, mem GB, time busy/idle
Cost: $/1k tokens at steady state

Instrumentation

Structured logs with request_id
Prometheus + Grafana dashboards
Black‑box probes for WebSocket QoS

Findings

Autoscaling reduces p95 latency drift during bursts; optimal thresholds at 70% CPU, 85% GPU.
Queue‑aware micro‑batching increases GPU occupancy to >85% without hurting first token delays.
Linear throughput to GPU count up to NIC saturation; gateway scaling remained non‑bottleneck.

Failure Tests

Rolling node drain: no data loss; in‑flight jobs retried once; idempotency ensured.
Redis pod restart: reconnect with exponential backoff; buffered queue preserved by PVC.
Gateway crash: clients auto‑reconnect; session resumes via request_id channels.

Security & Isolation

RBAC & Accounts

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Config & Secrets

envFrom:
- configMapRef: { name: llama-config }
- secretRef:    { name: llama-secrets }

Cost Model

We estimate cost per 1k tokens as a function of GPU seconds, gateway CPU seconds, and Redis I/O ops. Autoscaling lowers idle GPU minutes; queue‑aware batching increases useful GPU seconds per job.

Inputs

GPU: $/hr by SKU
Gateway: vCPU $/hr
Redis: $/GB‑mo + ops

Levers

Autoscaling thresholds
Batch size and window
Cache hit ratio

Outcome

−28% cost/1k tokens at target QoS
Stable p95 under bursty load

Limitations & Roadmap

Known Limitations

Single Redis shard in prototype; vertical limits apply.
GPU scheduling assumes homogeneous SKUs.
No admission control for prompt length yet.

Next Steps

Sharded Redis (Cluster) with consistent hashing.
Mixed‑precision batching, speculative decoding.
Adaptive autoscaling using queue depth + EWMA arrival rate.

Appendix

Quickstart

# 1) Deploy core services
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/redis.yaml
kubectl apply -f k8s/gateway.yaml
kubectl apply -f k8s/worker-gpu.yaml

# 2) Enable autoscaling
kubectl apply -f k8s/worker-hpa.yaml

# 3) Port-forward or expose gateway
kubectl port-forward svc/llama-api-gateway 8000:80

Useful kubectl

kubectl top pods -n llm  # requires metrics-server
kubectl get hpa -n llm
kubectl describe hpa llama-worker-hpa -n llm
kubectl get events --sort-by=.lastTimestamp -n llm

Related Research Papers

Key studies and technical works on Kubernetes-based, autoscaled LLM serving.

Chiron: Hierarchical Autoscaling for LLM Serving

Introduces a hierarchical autoscaling mechanism based on SLO-aware backpressure that improves GPU efficiency significantly.

Read Paper (HTML) | Read Paper (PDF)

ENOVA: Cost-Effective Serverless LLM Serving

Presents a deployment, monitoring, and autoscaling framework for stable multi-GPU LLM serving with improved cost and QoS.

Read Paper

Cloud Native System for LLM Inference Serving

Explores containerized, autoscaled architectures that dynamically adapt to workload fluctuations in real LLM inference workloads.

Read Paper

llm-d: Kubernetes-Native Distributed Inference

Describes a modular, cache-aware, routing-optimized serving framework built on Kubernetes, vLLM, and inference gateways.

Read Article

Serverless Inferencing on Kubernetes (KFServing + Knative)

Explains how serverless, scale-to-zero inference can be achieved on Kubernetes using KFServing and Knative.

Read Paper