AI Inference on Kubernetes: A Production Guide

A practical guide to running AI inference on Kubernetes at scale.

We're a team of ML and infra engineers who spent 2 years building production infrastructure for AI workloads - solving cold starts, GPU scheduling, and LLM-specific autoscaling. We've deployed everything from 7B parameter models to 405B monsters, handled millions of requests, and learned expensive lessons about what breaks at scale.

Now we're sharing everything: the patterns that worked, the tools we built, and the lessons we learned. All open-source.

By the end of this guide, you'll be able to:

1. Differentiate between infrastructure requirements for CPU vs GPU workloads

Understand why your 500MB Python API container balloons to 20GB+ for LLMs
Configure specialized GPU servers vs standard CPU nodes (memory requirements, network topology, storage patterns)
Implement KV-cache aware routing instead of round-robin load balancing
Handle model sharding across multiple GPUs and nodes

2. Modernize your existing K8s stack to extract peak performance for inference workloads

Reduce cold start times from 10+ minutes to under 30 seconds using model caching, image optimization, and lazy loading
Achieve 90%+ GPU utilization with proper batching, request routing, and resource allocation
Scale from 0 to 100+ pods in seconds with predictive autoscaling and warm pools
Eliminate OOM crashes with memory-aware scheduling and dynamic batching

3. Set up observability stack to monitor inference-specific metrics

Track per-request latency (P50/P95/P99), tokens per second, and time-to-first-token
Monitor hourly GPU costs and utilization across your fleet
Alert on KV cache pressure, batch queue depths, and model loading failures
Correlate inference metrics with business KPIs (user experience, cost per conversation)

Introduction: When Your Kubernetes Cluster Meets AI

If you're running traditional microservices on Kubernetes, you've likely spent years optimizing for horizontal scaling, fast container starts, and efficient resource bin-packing. Your cluster hums along nicely, scaling up during traffic spikes, packing containers efficiently onto nodes, and maintaining sub-second response times. Then someone asks you to deploy a large language model.

Suddenly, everything breaks.

Your first LLM deployment reveals a harsh reality: the assumptions that make Kubernetes excellent for microservices actively work against GPU inference workloads. That lightweight 500MB API container? It's now 20GB because model weights ship with the image. Your pods that started in seconds? They now take 10 minutes to become ready as gigabytes of weights load into GPU memory. Your elegant horizontal scaling? It falls apart when a single model requires multiple GPUs working in concert.

This guide explores why GPU workloads fundamentally differ from CPU workloads, and how to evolve your Kubernetes infrastructure to handle them effectively. We'll build understanding from first principles, starting with what makes inference unique, then systematically addressing each challenge through battle-tested patterns and tools.

The Fundamental Mismatch

Traditional Kubernetes workloads operate under predictable constraints. CPU and memory are relatively fungible resources. If a pod needs 2 CPU cores and 4GB RAM, any node with available capacity can run it. Load balancers distribute requests round-robin because one pod is as good as another. When traffic increases, you add more replicas. When it decreases, you remove them. Simple.

GPU inference shatters these assumptions. A model requiring 140GB of VRAM can't run on a 24GB GPU, no matter how many you have. Once a conversation starts on one pod, subsequent messages must route to the same pod to reuse cached attention keys and values - breaking round-robin load balancing. Cold starts aren't measured in milliseconds but minutes, making aggressive scale-down costly.

CPU Microservices vs GPU Inference Scaling

CPU Microservices

Resource Allocation

Container: 500MB

CPU: 2 cores

Memory: 4GB

Load Balancing

🔄

Round-robin

✓ Works perfectly

Startup Time

GPU Inference

Resource Allocation

Container: 20GB

GPU: 140GB VRAM

Multi-GPU: Sharded

Load Balancing

📍

Cache-aware sticky

⚠ Round-robin breaks

Startup Time

10+ minutes

Traditional Kubernetes optimizations work against GPU inference workloads

Let's understand why.

PART 1: Understanding AI Workloads on Kubernetes

The Architecture of Inference

Before we dive into Kubernetes specifics, we need to understand what happens during inference and why it demands such different infrastructure patterns.

1.1 The Inference Request Lifecycle

When a user sends a prompt to your LLM service, a complex dance begins. Understanding each phase reveals why generic autoscaling fails and what metrics actually matter.

The journey starts with tokenization. Your prompt "Explain quantum computing" becomes a sequence of token IDs: [18741, 31228, 37968]. This happens on CPU and is relatively fast, but for long documents, tokenization can consume non-trivial time. Some teams run separate tokenization services to offload this from GPU nodes.

Next comes the prefill phase. The model processes all input tokens in parallel, building up the initial attention matrices. This phase is compute-intensive - the GPU runs at high utilization, crunching through matrix multiplications. For a 2,000 token prompt on a 70B model, prefill might take 500-800ms. The output is the first generated token plus a set of key-value matrices that capture the context.

Then begins the decode phase, where magic happens one token at a time. The model generates "Quantum", then "computing", then "is"... sequentially. Each token generation requires accessing the key-value (KV) cache from previous tokens. This phase is memory-bandwidth bound - the GPU spends most time moving data from high-bandwidth memory (HBM) to compute cores rather than computing.

Inference Request Pipeline

📝

1. Tokenization

CPU-bound

"Hello" → [18741]

→

⚡

2. Prefill

Compute-intensive

2,000 tokens: 500-800ms

→

🔄

3. Decode

Memory-bound

Token-by-token generation

GPU Resource Utilization

Tokenization

GPU~5%

CPU~80%

Prefill

GPU~95%

Memory BW~40%

Decode

GPU~30%

Memory BW~90%

Decode phase is memory-bandwidth-bound despite showing low GPU utilization

This split personality - compute-bound prefill versus memory-bound decode - explains why GPU utilization metrics mislead. A GPU showing 30% utilization during decode isn't underutilized; it's bottlenecked on memory bandwidth. Adding more requests won't speed up token generation.

1.2 The Hidden State: KV Cache

The KV cache is inference's hidden state machine, and understanding it unlocks performance optimization. During attention computation, the model needs to consider all previous tokens. Recomputing attention from scratch for each token would be catastrophically slow. Instead, inference engines cache the key and value matrices from each layer.

Consider what this means for a conversation. When a user sends their first message, the KV cache is empty. After the model responds, the cache contains the attention state for both prompt and response. When the user sends their second message, the model reuses this cached state, dramatically accelerating inference.

But caches aren't free. For Llama 2 70B, each token in the conversation adds approximately 2.5MB to the KV cache. A 10,000 token conversation (a lengthy technical discussion) consumes 25GB of GPU memory just for cache. This is why long-context models are so challenging - a 128K context window could theoretically consume 320GB for KV cache alone.

This creates a routing problem. If a user's second message routes to a different pod, that pod must rebuild the entire KV cache from scratch - essentially reprocessing the entire conversation history. The latency penalty is severe: what should be a 200ms response becomes 5 seconds.

KV Cache Growth Over Conversation

GPU Memory Consumption (Llama 2 70B)

Turn 1100 tokens → 0.25GB cache

0.25GB

Turn 2500 tokens → 1.25GB cache

1.25GB

Turn 31500 tokens → 3.75GB cache

3.75GB

Turn 43000 tokens → 7.5GB cache

7.5GB

Turn 55000 tokens → 12.5GB cache

12.5GB

Turn 6 (miss)5000 tokens → 0GB cache

Cache Miss - Rebuild Required

• Each token adds ~2.5MB to KV cache (Llama 2 70B)

• 10,000 token conversation = 25GB GPU memory

• 128K context window could consume 320GB

Response Latency (ms)

5000ms3750ms2500ms1250ms0ms

200

210

220

250

280

5000

Turn 1Turn 2Turn 3Turn 4Turn 5Turn 6

Cache Miss Penalty

Routing to a different pod = rebuilding entire conversation history

200ms → 5000ms (25x slower)

Cache-aware routing is essential to avoid catastrophic latency penalties

Traditional load balancers know nothing about this state. They'll happily route requests round-robin, destroying cache locality and multiplying inference costs. This is why inference workloads need cache-aware routing - a fundamental departure from stateless microservice patterns.

1.3 Why Batching Changes Everything

Batching is the secret to GPU economics. A GPU processing one request at a time achieves perhaps 10-20% utilization. Process 50 requests simultaneously, and utilization jumps to 80-90%. The same hardware, 5x the throughput.

But batching inference isn't like batching database writes. Each request in a batch might be at a different position in its generation sequence. One request might be processing its prompt (prefill), another might be 100 tokens into generation (decode), and a third might be finishing up after 500 tokens. Modern inference engines handle this through sophisticated scheduling.

The challenge is that batching increases latency. If you wait 100ms to accumulate a batch of 10 requests, you've added 100ms to the response time of the first request that arrived. This creates a fundamental tension: throughput wants large batches, latency wants immediate processing.

Dynamic Batching in LLM Inference

Single Request (No Batching)

Request 1: Token 50/500

GPU idle

GPU Utilization:10-20%

Throughput:Low

Batched Processing (vLLM Continuous Batching)

Request 1: Prefill (2k tokens)

Request 2: Token 100/500

Request 3: Token 450/500

Request 4: Token 10/200

GPU Utilization:80-90%

Throughput:5x higher

Throughput vs Latency Tradeoff

HighThroughputLow

LowLatencyHigh

Small Batch

Low latency, poor GPU utilization

Optimal Batch

Balanced latency and throughput

Large Batch

High throughput, increased latency

vLLM's continuous batching allows requests to dynamically join/leave batches

vLLM addresses this with continuous batching - requests can join and leave the batch dynamically. When one request finishes generating, another can immediately take its place. This maintains high GPU utilization without forcing requests to wait for an entire batch to complete.

1.4 Choosing Your Inference Engine

Now that we understand the mechanics of inference, we can meaningfully evaluate inference engines. Each engine makes different tradeoffs, optimizing for specific workload patterns.

vLLM: The Production Workhorse

vLLM has become the de facto standard for good reason. Its PagedAttention algorithm manages GPU memory like an operating system manages RAM - allocating memory in pages, handling fragmentation, and enabling memory sharing between requests. When multiple requests share a common prompt prefix (common in few-shot learning), vLLM stores that prefix once and shares it across requests.

The real genius of vLLM is that it "just works" for most use cases. You point it at a Hugging Face model, and it handles the complexity. It automatically determines optimal batch sizes, manages memory allocation, and provides a clean OpenAI-compatible API. For teams getting started with LLM serving, vLLM offers the shortest path to production.

The tradeoff is that vLLM prioritizes throughput over individual request latency. Its continuous batching and memory optimization work best when you have multiple concurrent requests. For single-stream, latency-critical applications, you might see better performance with more specialized solutions.

SGLang: Exploiting Patterns

SGLang takes a different approach, optimizing for workloads with structural patterns. Its RadixAttention uses a radix tree to identify and cache common prefixes across requests. When your workload has natural reuse - multi-turn conversations, RAG with shared contexts, or few-shot prompting - SGLang can achieve dramatic speedups.

Consider a customer service bot that starts every conversation with the same system prompt and retrieval context. With vLLM, each conversation processes this prefix independently. With SGLang, the first conversation caches the prefix, and subsequent conversations reuse it. The speedup compounds with prefix length - a 2,000 token shared prefix saves 500-800ms per request.

SGLang also pioneers prefill-decode disaggregation. Instead of processing prefill and decode on the same GPU, it can route prefill (compute-intensive) to one GPU type and decode (memory-bandwidth-intensive) to another. This lets you optimize hardware allocation for each phase's characteristics.

TensorRT-LLM: Maximum Performance, Maximum Complexity

NVIDIA's TensorRT-LLM compiles models into highly optimized CUDA kernels. While vLLM and SGLang interpret models at runtime, TensorRT-LLM performs ahead-of-time compilation, fusing operations and optimizing for specific GPU architectures.

The performance gains can be substantial - 20-40% better throughput and latency compared to vLLM for some models. But this comes with significant operational complexity. You must compile models for each GPU type, manage compiled artifacts, and handle version compatibility. A model compiled for H100 won't run on A100.

TensorRT-LLM makes sense when you've standardized on NVIDIA hardware, need every ounce of performance, and have engineering resources to manage the complexity. For exploratory workloads or heterogeneous GPU fleets, the operational overhead rarely justifies the performance gains.

Triton: The Orchestration Layer

Triton Inference Server occupies a different niche - it's not an inference engine but an orchestration platform. Triton can host models from multiple frameworks (TensorRT, ONNX, PyTorch, vLLM) behind a unified API. You might run your LLM on vLLM, your embedding model on ONNX Runtime, and your classification model on TensorRT, all served through Triton.

This flexibility makes Triton valuable for complex inference pipelines. A single request might flow through multiple models - content moderation, embedding generation, retrieval, and finally LLM generation. Triton handles the orchestration, batching each stage independently for optimal throughput.

1.5 Kubernetes Deployment Considerations

Deploying these engines on Kubernetes introduces additional complexity. Each engine has specific requirements that must be addressed for successful operation.

All engines need shared memory for multi-GPU communication. The default Docker shared memory size (64MB) is woefully inadequate. You'll need to mount /dev/shm as a volume with sufficient capacity - typically 10-20% of GPU memory. This enables efficient tensor passing between GPU processes.

Model storage strategy dramatically impacts startup times. Downloading a 140GB model from Hugging Face on every pod start is untenable. Instead, pre-cache models in persistent volumes. When pods start, they mount the volume and load from local storage. This reduces startup time from 10+ minutes to 30-60 seconds.

Health checks require special consideration. The default Kubernetes probe timeouts assume applications start quickly. For inference workloads, initial model loading can take minutes. Configure startup probes with generous timeouts (5-10 minutes) and use readiness probes that verify the model is loaded and responsive.

For multi-node tensor-parallel deployments, pod coordination becomes critical. A model sharded across 8 GPUs needs all 8 pods running before any can serve traffic. The LeaderWorkerSet pattern addresses this - one pod acts as coordinator, waiting for all workers before marking the service ready.

2. Jobs

Jobs are an integral part of AI/ML workloads on Kubernetes, being used for distributed training runs, model finetuning, inference batches, and generative media pipelines. For distributed training across multiple GPUs or nodes, Jobs form the orchestration layer that coordinates worker pods, each running its own training process while communicating via frameworks like PyTorch Distributed Data Parallel (DDP) or similar protocols.

2.1 Overview

AI/ML Jobs encompass the breadth of machine learning workloads that have distinct start and end points: batch inference, model evaluation, hyperparameter tuning, data preprocessing, and finetuning.

Batch Inference

Batch inference is perhaps the most common use case. It processes large datasets asynchronously. A run reads millions of observations, fetches a trained model from storage, runs predictions on the entire batch, and writes results to a data warehouse, object store, or downstream system. Using Jobs makes these tasks very cost efficient because the GPUs remain busy processing continuous streams of data instead of waiting for individual prediction requests.

They can also be scaled horizontally by splitting data into shards and deploying multiple job pods in parallel, each processing its subset and writing results independently. Kubernetes' Indexed Jobs are particularly effective here; each pod receives a unique JOB_COMPLETION_INDEX environment variable, allowing deterministic shard assignment without external coordination logic.

Best Practices

Failure Handling and Pod Failure Policy

Real jobs fail for reasons both retriable (transient network issues) and non-retriable (bad input data, code bugs). Pod Failure Policy distinguishes between them, enabling intelligent retry behavior. With the policy:

Ignore: DisruptionTarget (Spot instance preemptions) don't count against retry limit
Retry: Exit code 1 (network timeout) triggers a new pod without failing the Job
FailJob: Exit code 42 immediately fails the entire Job; no retry
FailIndex: For Indexed Jobs, marks this index as failed (other indexes continue)

Timeout and Long-Running Jobs

GPU jobs are lengthy. Training runs and batch inference on large datasets might take hours. Configure timeouts to prevent hanging jobs from consuming resources forever. activeDeadlineSeconds can be used to set these timeouts. If the Job hasn't completed after the configured duration, all running pods are forcibly terminated and the Job is marked Failed.

CronJobs for Scheduled Batch Work

CronJobs trigger Jobs on a schedule, useful for recurring finetuning (daily model updates), batch inference (weekly reports), or cleanup tasks. Set concurrencyPolicy as Forbid to prevent overlapping job runs. If a finetuning job runs long, the next scheduled trigger waits until completion.

Gang Scheduling for Distributed Jobs

Distributed training requires all workers to start simultaneously. If only some pods launch while others wait, the launched pods consume GPU capacity without making progress. You can use KubeRay which handles gang scheduling natively.

[Diagram: DAG workflow showing chained ML jobs - preprocessing → training → evaluation → deployment]

2.2 Finetuning

Finetuning adapts a pre-trained model to specific tasks or domains by training on task-specific data. Modern finetuning is parameter-efficient i.e. only a small fraction of model parameters are trained while the base model remains frozen, drastically reducing computation and memory requirements compared to full model training.

2.2.1 Axolotl

Axolotl is a modern finetuning framework optimized for LLMs. It abstracts away the complexity of distributed training, mixed precision, and parameter-efficient techniques (LoRA, QLoRA), providing a simple YAML configuration interface.

Why Axolotl

Configuration-driven: Define training entirely in YAML; no Python scripting required
Multi-GPU support: Seamless DDP (Distributed Data Parallel) and FSDP (Fully Sharded Data Parallel) integration via Accelerate
Production-ready: Checkpoint management, validation, early stopping built-in
Memory-efficient: Native QLoRA support (4-bit quantization + LoRA) enables training on smaller GPUs

Multi-GPU Training Strategies

Axolotl supports multiple distributed training paradigms:

Distributed Data Parallelism (DDP): Full model on each GPU; data split across GPUs. Simplest approach, works for models fitting in single GPU memory.
Fully Sharded Data Parallelism (FSDP): Model, optimizer, and gradients sharded across GPUs. Enables training models larger than single GPU memory. For a 70B model across 8 A100s, each GPU holds 70B/8 ≈ 8.75B parameters.
QLoRA with FSDP: 4-bit quantize the base model, apply LoRA, and shard across GPUs. Enables training 70B+ models on modest hardware.

2.2.2 LoRA Management

Low-Rank Adaptation (LoRA) trains small, task-specific weight matrices instead of full model weights. A 70B model fine-tuned with LoRA creates a 1-2MB adapter (vs. 140GB full model), enabling efficient storage, versioning, and serving of multiple task-specific variants.

/tmp/output/
├── adapter_config.json      # LoRA configuration
├── adapter_model.bin        # LoRA weights (~1-2MB)
├── training_args.bin
└── checkpoint-500/
    ├── adapter_model.bin
    └── optimizer.pt

2.3 Image/Video Workloads

Generative image and video models (Stable Diffusion, Flux, AnimateDiff) are compute-intensive and suited to batch processing. Unlike real-time LLM inference, image generation operates asynchronously: users submit prompts, receive job IDs, and poll for completion or receive notifications when rendering finishes.

2.3.1 ComfyUI: Node-Based Workflows

ComfyUI represents pipelines as node graphs. Key operations include loading models, applying text conditioning, running diffusion sampling, and saving outputs. Workflows are defined as JSON DAGs, enabling parameterization and orchestration. Custom nodes extend functionality for ControlNet, upscaling, LoRA composition, and other specialized tasks.

[Screenshot: ComfyUI node-based interface showing a complete image generation workflow]

PART 2: Understanding the Infrastructure

3. Hardware

3.1 Choosing the Right Machine for the Job

GPU selection is a critical architecture decision as it directly influences cost per inference, application latency, overall system throughput, and the operational complexity of your deployment. It is affected by three key technical factors: memory bandwidth, memory capacity, and cost efficiency; your business's service level agreements also play a big role in what GPU to choose.

3.1.1 Calculating the Model's Memory Requirements

For LLM inference, the basic formula is:

Total GPU Memory = Model Weights + KV Cache + Activation Memory + Input/Output Buffers + Overhead

1. Model Weights

These make up the lion's share of the memory required to run inference for that model. Quantization can reduce a model's memory requirements by 50-87.5%.

Model Memory = Number of Parameters × Bytes per Parameter

By precision:

FP32: Parameters × 4 bytes
FP16/BF16: Parameters × 2 bytes
INT8: Parameters × 1 byte
INT4: Parameters × 0.5 bytes

2. KV Cache

The KV cache is critical for transformers and after model memory, is the largest memory consumer during generation.

KV Cache = 2 × Batch Size × Max Sequence Length × Number of Layers × Hidden Dimension × Bytes per Parameter

The "2" comes into the equation because both Keys (K) and Values (V) need to be cached.

Examples:

GPT-3 175B (FP16, Batch size = 1, Max len = 2048 tokens)

Model weights: 175B × 2 = 350 GB
KV Cache: 2 × 1 × 2048 × 96 × 12288 × 2 = ~9 GB
Activations + Overhead: ~2GB
Total: ~361 GB
Needs 5× A100 80GB or model parallelism

Llama 2 7B (FP16, Batch size = 1, Max len = 2048 tokens)

Model weights: 7B × 2 bytes = 14 GB
KV Cache: ~1 GB
Activations + Overhead: ~1.5 GB
Total: ~16.6 GB
Fits on A10G (24GB)

3.1.2 Key GPU Specifications and Their Impact

Understanding GPU architecture is essential for making informed decisions:

Memory Capacity: Sets hard limits on model size and batch capacity. Quantization enables you to fit larger models on the same GPU by reducing memory requirement at the cost of performance.
Memory Bandwidth: The primary constraint for LLM inference. H100 (3.35 TB/s) vs. L40S (864 GB/s) represents a 3.8x difference in memory throughput.
Tensor Core Architecture: H100's 4th-gen Tensor Cores support native FP8 operations, delivering 2-4x compute advantages over A100.
NVLink Interconnect Bandwidth: Critical for multi-GPU deployments. H100 HGX systems provide 900 GB/s per-GPU NVLink bandwidth vs. PCIe's 32 GB/s - a 28x difference.

3.1.3 Single vs. Multi-GPU Trade-offs

The Single Large GPU Approach (1x H100)

This configuration minimizes latency by keeping the entire model on a single device. All memory access happens over ultra-fast HBM, eliminating inter-GPU communication overhead. For latency-critical applications, this approach delivers the lowest possible Time To First Token (TTFT) and most consistent performance.

The Multi-GPU Approach (2x L40S)

This configuration introduces complexity through tensor parallelism, where the model is split across GPUs and requires inter-GPU communication for every forward pass. However, this approach may offer substantial cost advantages as smaller GPUs typically cost significantly less per unit.

Example

Consider deploying a model requiring 80GB of VRAM:

1x H100 (80GB): Minimizes latency, eliminates inter-GPU communication. Best for: latency-critical services (TTFT <200ms required).
2x L40S (48GB each): Tensor parallelism introduces communication latency. Cost 40-60% lower than H100 but with higher TPOT. Best for: batch processing where throughput/$ matters.

3.1.4 How Service-Level Objectives Drive Hardware Decisions

Real-Time Interactive Applications (Ultra-Low Latency SLO)

Applications like chatbots, coding assistants, or real-time conversational AI require immediate response to maintain user engagement. For these workloads, Time To First Token (TTFT) under 200ms and consistent per-token latency are critical. This pushes you towards single, powerful GPUs with maximum memory bandwidth.

Batch Processing Workloads (High Throughput SLO)

For applications like document summarization, content generation, or batch data analysis where individual request latency is less critical, the optimization target shifts to maximum throughput per dollar. Here, multi-GPU configurations may provide better economics.

3.2 Provisioning GPUs

3.2.1 GPU Node Pools: Managing Heterogeneous GPU Infrastructure

In production Kubernetes environments, you'll inevitably operate a heterogeneous mix of GPU hardware. Different models require different GPU configurations, and cost optimization often demands using the most appropriate hardware for each specific workload. Node Pools provide the organizational structure to manage this complexity effectively.

A node pool represents a group of nodes that share identical configuration—machine type, disk configuration, GPU type, and other specifications. As the number of your workloads start to increase, so does your number of nodepools.

Karpenter: Dynamic GPU Node Provisioning

Traditional Kubernetes autoscaling (Cluster Autoscaler) operates at the node group level. Karpenter solves this by shifting from static node groups to dynamic requirements. Instead of predefined node groups, you define abstract NodePool requirements, and Karpenter automatically provisions the cheapest instance that satisfies them.

spec:
  requirements:
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: ['g4dn', 'g5', 'g5g', 'g6', 'g6e', 'p2', 'p3', 'p3dn', 'p4d', 'p5']
    - key: kubernetes.io/os
      operator: In
      values: [ "linux" ]
  nodeClassRef:
    name: gpu-v1-ec2nodeclass

Node Autoprovisioning: Eliminating Manual Node Pool Management

When a pod remains unschedulable due to insufficient resources, the autoprovisioner analyzes the pod's resource requests and scheduling constraints and dynamically selects the optimal instance type from hundreds of options. It then creates a new node pool with that instance type and provisions a node.

Feature	Node Autoprovisioning (Cloud-Managed)	Karpenter (Open-Source)
Availability	Available on GKE, AKS; AWS lacks direct equivalent	Currently AWS only; GCP and Azure support in development
Management	Cloud provider handles driver installation, taints, labels	Requires custom configuration and provider setup
Best For	Teams wanting simplicity and cloud-provider managed infrastructure	Advanced consolidation strategies, fine-tuned cost optimization

3.2.2 NVIDIA Device Plugin and GPU Operator

NVIDIA Device Plugin

Natively, Kubernetes is unaware of any GPUs that might be present on any given node. The NVIDIA Device Plugin is the essential component that bridges this gap. It's a DaemonSet that runs on every node, discovers the available GPUs, and advertises them as a schedulable resource to the Kubernetes API server. This allows you to request GPUs in your pod specifications as:nvidia.com/gpu: 1

NVIDIA GPU Operator

The NVIDIA GPU Operator addresses complexity by automating the entire lifecycle of GPU-related software components. Implemented as a Kubernetes Operator, it manages:

Driver Installation: Automatically installs and manages NVIDIA drivers compatible with the specific GPU hardware and kernel version
NVIDIA Container Toolkit: Configures the container runtime to expose GPU devices to containers
Monitoring Stack: Deploys DCGM Exporter for comprehensive GPU telemetry metrics
GPU Feature Discovery: Automatically labels nodes with GPU characteristics
Multi-Instance GPU (MIG): Manages MIG configuration and partitioning

3.2.3 GPU Scheduling and Allocation

Resource Requests and Limits

GPU resources are specified using the custom metric nvidia.com/gpu in pod resource specifications. Unlike CPU and memory, GPU resources follow special constraints - GPUs are only specified in the limits section, not requests.

Node Selectors and Tolerations

Node Selectors provide simple label-based scheduling constraints:

spec:
  nodeSelector:
    karpenter.k8s.aws/instance-family: g5

Taints and Tolerations enforce more granular control, preventing non-GPU workloads from consuming expensive hardware.

GPU Sharing Strategies

Multi-Instance GPU (MIG)

MIG is hardware-level GPU partitioning available on GPUs starting with the NVIDIA Ampere generation. It divides a single GPU into up to 7 isolated instances, each with dedicated compute, memory, and cache resources. Standard A100-80GB profiles include: 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb.

When to use MIG:

Multi-tenant inference where workload isolation is critical
Strict fairness requirements
Models sized for MIG slices (e.g., 10GB models)

Time-Slicing

Time-slicing is software-based GPU oversubscription using context-switching. The Device Plugin exposes a single GPU as multiple virtual GPUs; the NVIDIA driver schedules processes on the GPU in round-robin fashion.

When to use time-slicing:

Low-utilization batch inference workloads (<25% duty cycle)
Development and testing environments
Situations where MIG is unavailable (older GPU generations)

Multi-Process Service (MPS)

MPS enables concurrent execution of multiple GPU processes through a shared server architecture. Unlike time-slicing (which context-switches work), MPS allows kernel and memory copy operations from different processes to overlap on the GPU, achieving true concurrent execution.

When to use MPS:

When strict isolation is not required (trusted workloads)
High-frequency inference workloads requiring kernel-level concurrency benefits
Situations where GPU memory bandwidth is the bottleneck

Factor	MIG	MPS	Time-Slicing
Isolation	Full (hardware)	Weak (memory only)	None
Concurrency	Isolated	Concurrent kernels	Context-switched
Performance overhead	None	7-8% for LLM	10-15% per virtual GPU
Use case	Multi-tenant production	High-frequency inference	Low-utilization workloads

[Diagram: Visual comparison of MIG, MPS, and Time-Slicing architectures showing isolation levels and resource sharing]

3.3 GPU Access Patterns

How you procure GPU capacity from cloud providers significantly impacts both cost and operational complexity. Modern cloud providers offer diverse purchasing models:

Instance Type	Pros	Cons	Ideal Use Case
On-Demand	Flexibility, no commitment	Premium pricing	Development, testing, unpredictable workloads
Reserved	Up to 72% discount, predictable costs	Requires upfront commitment	Real-time inference with long-term predictable usage
Spot	Up to 90% discount	Can be interrupted with 2-minute notice	Fault-tolerant training jobs, non-critical workloads
Capacity Blocks	Reserve capacity for specific time windows	Less flexible, minimum reservation durations	Planned training runs, scheduled batch jobs

4. Autoscaling

Autoscaling in Kubernetes adapts compute resources dynamically to workload demands. For AI workloads, autoscaling becomes fundamentally different from traditional microservices. LLM inference is GPU-bound, long-running, and exhibits highly variable latency.

Effective autoscaling for LLM workloads requires orchestration at two levels: pod-level scaling (adjusting replica counts based on concurrency and inference metrics) and node-level scaling (provisioning GPU capacity just-in-time).

4.1 Pod Level Autoscaling

4.1.1 Horizontal Pod Autoscaler (HPA)

The Kubernetes Horizontal Pod Autoscaler (HPA) is the native scaling mechanism for adjusting pod replica counts. HPA continuously monitors metrics, compares them to targets, and calculates the desired replica count.

Why CPU and Memory Scaling Fail

Decode-phase inference (generating tokens one-by-one) is memory-bandwidth-bound and shows low CPU utilization even under high concurrency. Memory utilization is similarly misleading; many GPUs can hold high batch sizes in memory but generate tokens slowly, appearing underutilized while requests queue up.

LLMs thus need custom metric based scaling.

4.1.2 Custom Metrics for LLM Autoscaling

Request Rate Based Scaling

Request rate or Requests per Second (RPS) measures the number of inference requests per second across your pod fleet. However, it doesn't account for request complexity. Use RPS when your workload has uniform request characteristics; otherwise, prioritize queue depth or token-based metrics.

Queue Depth Scaling

Queue depth measures how many units of work are waiting to be processed. This metric provides the most reliable scaling signal because it directly represents unmet demand. The relationship between queue depth and latency is nearly linear.

For inference, the queue exists inside the LLM server (vLLM, TensorRT-LLM, TGI). For jobs, the queue is external (Redis, RabbitMQ, SQS, Kafka).

Current Batch Size

This metric measures how many inference requests are actively being processed by the GPU at this moment. Current batch size correlates tightly with latency percentiles; when batch size approaches your model's configured maximum, the GPU is saturated and additional requests must wait in queue.

Tokens Per Second

Tokens per second (TPS) measures the throughput of your inference cluster. By setting a target TPS per pod, you instruct the HPA to provision replicas based on total token demand. This creates a direct, predictable relationship: more throughput demand equals more pods.

However, TPS is a throughput metric and does not inherently guarantee latency. A high TPS target encourages aggressive batching, which can increase latency. TPS is best used in a multi-metric autoscaling strategy combined with queue depth.

Request Latency (p95, p99)

Request latency is the time from submission to result delivery. However, latency is not suitable as a direct HPA scaling target because it's a lagging indicator. Instead, use latency as a tuning parameter for your primary scaling metric.

KV Cache Usage

The KV cache stores precomputed Key and Value matrices for all previously generated tokens. For long-context workloads (100K+ tokens), the KV cache can consume 10-30GB of GPU memory per request, becoming a significant memory constraint. KV cache usage indicates how much GPU memory remains for batch processing.

Combining queue depth and KV cache targets provides defense-in-depth: queue depth protects latency SLOs; KV cache usage protects against thrashing due to memory pressure.

GPU Utilization

GPU utilization measures the percentage of time the GPU's compute cores are active. However, GPU utilization and latency are not tightly coupled. Use GPU utilization as a safety valve (scale up if it exceeds 90%) and as a leading indicator for cost optimization, but not as a primary scaling target.

4.2 Node Level Autoscaling

4.2.1 The Ideal Autoscaler

The characteristics that a node autoscaler must have to be optimal for AI/ML workloads are:

Immediate scaling response: Provisions nodes for pending pods within seconds
Just-in-time provisioning: Brings up nodes only when required
Multi-resource awareness: Considers CPU, memory, GPU type/count, and custom requirements
Flexible instance integration: Seamlessly provisions On-Demand, Spot, Reserved Instances, and Capacity Blocks
Bin-packing efficiency: Consolidates workloads onto fewer nodes
Fast, safe consolidation: Aggressively removes underutilized nodes without disrupting services

4.2.2 Kubernetes Cluster Autoscaler vs Karpenter

Feature	Cluster Autoscaler	Karpenter
Scaling Speed	1-3+ minutes from pod creation to node ready	<1 minute in most cases
Instance Type Selection	Constrained by pre-defined ASG instance types	Dynamically evaluates all compatible instance types
Bin-Packing Efficiency	Scales individual ASGs independently	Performs global bin-packing across all instance types
Cost Optimization	Reactive only—scales up/down based on utilization	Proactive—continuously evaluates for consolidation opportunities
Best For	Simple, homogeneous workloads with predictable patterns	Complex GPU workloads with diverse requirements

Karpenter Dynamic Node Provisioning Architecture

Provisioning Workflow

Unschedulable Pod

GPU Request:1x A100-80GB

Memory:256GB

Status:Pending

Karpenter Controller

Watches Kubernetes API in real-time

✓Detects unschedulable pod instantly

✓Analyzes resource requirements

Instance Selection

Global bin-packing across all instance types

p4d.24xlarge

8x A100

g5.48xlarge

8x A10G

✓ p5.48xlarge

Cheapest option selected

Node Provisioned

Node ready in <1 minute

✓Pod scheduled and running

Key Components

☸️

Kubernetes API Server

Central control plane

• Tracks pod scheduling status

• Reports unschedulable pods

• Registers new nodes

⚡

Karpenter

Dynamic provisioner

Controller

Watches for pending pods

NodePool

Defines instance requirements

NodeClass

Cloud-specific configuration

☁️

Cloud Provider API

AWS, GCP, Azure

• Provisions EC2/GCE/VM instances

• Configures networking & IAM

• Handles instance lifecycle

Karpenter vs Cluster Autoscaler Performance

Cluster Autoscaler

Provision Time1-3+ minutes

180s

• Polls every 10 seconds

• Constrained by pre-defined ASGs

• Per-ASG bin-packing

Karpenter

Provision Time<1 minute

45s

• Real-time API watching

• Dynamic instance selection

• Global bin-packing

3-4x Faster

Node provisioning speed

20-40% Savings

Cost optimization

Zero Config

No ASG management

Karpenter's real-time provisioning eliminates the polling delay of traditional autoscalers

4.2.3 Autoscaling Best Practices

Right-Sizing GPU Nodes

GPU node provisioning decisions have enormous cost implications. Right-sizing requires understanding your model's resource footprint:

Model VRAM: How much GPU memory does your model occupy at inference time
KV cache headroom: Reserve memory for KV cache growth with long contexts
Batch processing: Determine target batch size based on throughput/latency requirements

Example: Deploying Llama 3 70B (fp16)

Model weights: ~140GB
KV cache (single request, 128K context): ~25GB
Batch size 4 total KV cache: 4 × 25GB = ~100GB
Total: ~250GB required
A single H100 (141GB HBM) cannot fit this workload. You need distributed inference across 2x H100s.

Cost Optimization Through Capacity Type Selection

Strategic use of different capacity types—Spot, On-Demand, Reserved Instances, and Capacity Blocks—can reduce expenses by 40-70% while maintaining reliability.

Handling Node Interruptions

PodDisruptionBudgets prevent simultaneous eviction of too many replicas, ensuring at least 1 inference pod is always available. Appropriate Termination Grace Periods allow requests to complete before forced termination.

PART 3: Managing the Infrastructure

5. Storage

Efficient storage management is critical for AI workloads in Kubernetes. Unlike traditional applications, AI systems must handle exceptionally large artifacts: container images (often 10+ GB), model files (gigabytes to hundreds of gigabytes), and massive datasets.

5.1 Image Storage

Container images for AI workloads are typically large, often exceeding 10 GB. They bundle frameworks like TensorFlow and PyTorch along with system dependencies such as CUDA for GPU acceleration. The base CUDA image alone is 4 GB.

Lazy Loading

Lazy loading defers file downloads until the container actually needs them, drastically reducing start times. Popular open-source options include Nydus, SOCI, and eStargz.

However, the real-world speedup for AI workloads is less dramatic. AI services typically require a couple of minutes to start regardless—they must verify modules, download models, load them into GPU memory, and compile them for execution. In practice, lazy loading provides a 1.25x-3x speedup.

[Graph: Comparison of container startup times with different snapshotters showing overlayfs baseline vs. Nydus, SOCI, and eStargz performance]

5.2 Model Storage

AI models can be very large, so how you store and access your models has a huge impact on performance:

Embedded in Container Image: Most straightforward but bloats images. Only feasible for small models (<5GB) that don't change frequently.
Object Storage (S3, GCS, Azure Blob): Scalable and cost-effective but suffers from increased cold start times. Good for cold storage and backup.
Network File System (NFS): Fast but expensive. Good when multiple pods need access to the same model. Can become a single point of failure.
Shared Volume with Cache: Models stored centrally in object storage, with a DaemonSet pre-warming a local cache on fast disk (NVMe SSD) on each node. Inference pods mount this local cache for near-instant access.
HuggingFace Xet: Content-defined chunking breaks files into small, deduplicated pieces. Only changed chunks are transferred. Use hf-transfer library for high-speed parallel downloads.

5.3 Dataset Storage

Training workloads demand fast, sustained access to datasets. Storage options include:

1. Local NVMe Storage

NVMe drives attached directly to compute instances offer the highest performance—up to 10+ GB/s read/write speeds and sub-millisecond latency. NVMe can deliver 20x better performance and 10x lower latency compared to network block storage. Best suited for temporary storage needs like intermediate checkpoints and cache layers.

2. Object Storage (S3, GCS, Azure Blob)

Scalable, durable storage with tiered pricing models. Significantly slower than local NVMe (up to 20x slower) but cost-effective for raw datasets and long-term archival. Should be combined with caching strategies for active training workloads.

3. Network File Systems (NFS/EFS/FSx)

Managed network file systems provide shared persistent storage accessible across multiple pods. Data survives pod restarts and node failures. Significantly slower than local NVMe but suitable for shared datasets that need concurrent access.

The most performant approach combines multiple storage tiers: object storage as the primary tier (complete dataset as source of truth), local NVMe on each node as high-speed cache, and a DaemonSet handling cache population and consistency.

InfiniBand for Distributed Training

When training across multiple nodes, inter-GPU communication becomes a bottleneck. InfiniBand provides ultra-low latency (sub-microsecond) and high bandwidth (200-400 Gbps) networking specifically designed for HPC and distributed training workloads. For large-scale distributed training, InfiniBand can reduce training time by 30-50%.

6. Networking and Routing

6.1 Gateway Configuration

Traditional Kubernetes load balancing treats all backends as equivalent; round-robin across pods works for stateless web apps but fails for LLM inference. Inference workloads are long running, maintain state in GPU memory (KV cache), and have vastly different resource requirements per request.

The Gateway API Inference Extension addresses this by making gateways inference-aware. It introduces two Custom Resources: InferenceModel (defines logical model endpoints) and InferencePool (groups pods serving models). Instead of blind round-robin, the gateway queries backend metrics such as queue depth, KV cache state and GPU utilization and routes intelligently.

6.2 Load Balancing Strategies

KV Cache-Aware Routing

When requests share prefixes (conversation history, RAG contexts, few-shot examples), routing them to pods with cached content avoids recomputation. llm-d achieves 87% cache hit rates with 88% faster Time To First Token for warm cache hits versus cold.

Queue Depth Balancing

Rather than distributing requests evenly, the gateway sends requests to pods with the shortest queues. Since LLM inference duration varies wildly, queue-based balancing keeps GPUs busy without overloading any single pod.

Criticality-Based Routing

Mark requests with priority levels; the gateway sends high-priority requests to dedicated pods or puts them at the front of queues, ensuring chatbot users get fast responses while background summarization jobs can wait.

6.3 Model Version and Traffic Management

Safely rolling out new models requires traffic splitting. Use InferenceModel to define logical endpoints that split traffic across multiple backend versions: 90% to stable llama-v1, 10% to canary llama-v2. This enables A/B testing different models or fine-tuned adapters without service disruption.

LoRA Adapter Routing becomes critical when serving multiple fine-tuned versions. Route requests to pods with the requested adapter already loaded in GPU memory, eliminating dynamic loading latency.

For disaggregated serving (separate prefill and decode pods), the gateway routes initial prompts to prefill servers and subsequent token generation to decode servers, optimizing resource allocation.

6.4 Implementation Considerations

Deploy the Gateway API Inference Extension through Helm charts that configure the gateway and External Processing Pod (EPP), the component that queries backend metrics and makes routing decisions. Monitoring is essential—export metrics like cache hit rates, queue depths per pod, and latency distributions.

Load Balancing Strategies Comparison

⚠️

Round-Robin

Traditional (Breaks for LLMs)

Gateway

→

Pod A

Pod B

Pod C

❌Cache locality destroyed

❌Latency penalty: 200ms → 5s

📊

Queue Depth Aware

Send to shortest queue

Gateway

↓

Pod A

Queue: 2

Pod B

Queue: 8

Pod C

Queue: 6

✓Prevents hot spots

✓Balanced GPU utilization

🎯

KV Cache Aware

Route to cached pod

Gateway

↓

(checks cache)

Pod A

Cache: ✓ Hit

Pod B

Cache: Miss

Pod C

Cache: Miss

✓87% cache hit rate

✓88% faster TTFT (warm hits)

Performance Impact

Round-Robin

Cache Hit Rate~0%

Avg TTFT5000ms

Queue Depth Aware

Balance Score95%

Avg TTFT300ms

KV Cache Aware

Cache Hit Rate87%

Avg TTFT200ms

Intelligent routing strategies dramatically improve performance for LLM workloads

7. Observability

Observability is critical for maintaining reliable, performant, and cost-effective AI workloads on Kubernetes. Whether you're running inference services or finetuning jobs, comprehensive observability enables you to understand system behavior, troubleshoot issues quickly, and optimize resource utilization.

7.1 Metrics

For AI systems running on Kubernetes, you need to track three distinct categories of metrics: resource metrics, request metrics, and business metrics.

7.1.1 Resource Metrics

What to Track:

GPU Utilization: Percentage of time GPUs are actively processing workloads
GPU Memory Usage: Framebuffer usage (total, used, and free)
GPU Temperature and Power: Thermal metrics and power consumption
Tensor Core Utilization: How efficiently specialized compute units are used
PCIe and NVLink Throughput: Data transfer rates between CPUs/GPUs
CPU and Memory: Traditional resource metrics for non-GPU components

What to Use:

NVIDIA DCGM (Data Center GPU Manager) is the industry-standard tool for monitoring NVIDIA GPUs in Kubernetes environments. DCGM provides low-overhead telemetry collection with support for fine-grained metrics. It integrates seamlessly with Kubernetes through the DCGM Exporter, which exposes GPU metrics in Prometheus format with per-pod, per-container labels.

7.1.2 Request Metrics

What to Track:

Request Rate: Requests per second
Latency Metrics: End-to-end latency including percentiles (p50, p95, p99). For LLMs, separate time-to-first-token (TTFT) from total generation time
Token Metrics: Input tokens, output tokens, and total tokens processed per request
Error Rates: Failed requests, timeout errors, and model errors
Throughput: Tokens per second or batches per second
Queue Depth: Pending requests in serving queues
Model-Specific Metrics: Model name, version, temperature parameters

What to Use:

Prometheus is the de facto standard for metrics collection in Kubernetes environments. For AI-specific instrumentation, use OpenTelemetry with OpenLLMetry extensions. OpenLLMetry automatically captures prompt text, completion tokens, model parameters, token usage breakdown, and latency per operation.

7.1.3 Business Metrics

What to Track:

Cost Metrics: Cost per inference request, per 1M tokens, GPU idle cost, total cost of ownership
Quality Metrics: Hallucination rates, output quality scores, user satisfaction scores
Operational Metrics: Time to value for model deployments, model retraining frequency
Usage Metrics: Active users, session counts, feature adoption rates, API consumption patterns

Calculating Costs

Capture GPU costs using your cloud provider's billing API or tools like Kubecost to get the hourly cost of GPU instances. For request-based pricing, instrument your application to emit request count metrics using Prometheus counters.

For token-based pricing, track both input and output tokens. Most inference frameworks expose token counts directly. Calculate:

Wasted GPU Cost = Total GPU Cost × (1 - GPU Utilization)

7.2 The DCGM Exporter

The DCGM Exporter bridges NVIDIA's Data Center GPU Manager with Kubernetes-native monitoring stacks. It runs as a DaemonSet on GPU nodes, using DCGM's Go bindings to collect telemetry and expose it via an HTTP /metrics endpoint in Prometheus format.

Key Features:

Per-Pod GPU Metrics: Identifies which GPUs are assigned to which pods for cost allocation and troubleshooting
Configurable Metrics: Customize which GPU metrics to collect using CSV configuration files
Profiling Metrics: Access fine-grained utilization metrics including SM occupancy, Tensor Core activity
MIG Support: Monitors Multi-Instance GPU partitions separately

7.3 Prometheus

Prometheus serves as the metrics aggregation and storage layer. Its pull-based architecture, powerful query language (PromQL), and native Kubernetes integration make it the standard choice for cloud-native observability.

Configure Prometheus to scrape DCGM Exporter endpoints for GPU metrics, application metrics for inference servers, and kube-state-metrics for Kubernetes object state. Set appropriate scrape intervals—typically 15-30 seconds for production systems.

7.4 Tracing

Distributed tracing is essential for understanding complex AI workflows where a single user request may traverse multiple services, model calls, retrieval operations, and external APIs.

Why Tracing Matters for AI Workloads:

A RAG-based LLM application might involve: user query received → query embedding generation → vector database retrieval → context aggregation → prompt construction → LLM inference call → response postprocessing → output validation. Each step may execute in different services, making it impossible to reconstruct the full request path without distributed tracing.

What Tracing Captures:

Spans: Individual operations with start time, duration, and metadata including prompt text, model name, token counts, and error details
Traces: End-to-end request journeys composed of parent-child span relationships
Context Propagation: Trace IDs and span IDs propagated across service boundaries

Tracing Tools:

Common open-source tracing tools include Jaeger, Zipkin, and OpenTelemetry. OpenTelemetry has an AI-specific extension called OpenLLMetry which handles prompt/completion logging, token usage tracking, model parameter capture, and cost attribution automatically.

7.5 Visualization

7.5.1 Grafana

Grafana is the leading open-source visualization platform for Kubernetes observability, offering rich integrations with Prometheus, Loki, Jaeger, and other data sources. It supports multi-source dashboards, combining metrics from multiple data sources.

AI-Specific Dashboards

Modern Grafana deployments for AI workloads should include:

GPU utilization and efficiency metrics per workload
Inference latency percentiles and throughput
Model performance metrics (accuracy, drift indicators)
Cost dashboards showing GPU spend by team, project, or model
Token usage and API consumption trends

Pre-Built Dashboards

The community provides extensive dashboard templates including:

NVIDIA DCGM dashboards for GPU utilization, memory, temperature, and power consumption
Kubernetes cluster resource dashboards for CPU, memory, network, and storage
Deployment and pod-level views with drill-down capabilities

[Screenshot: Example Grafana dashboard showing GPU metrics including utilization, memory usage, temperature, and power consumption across multiple nodes]

Best Practices:

Organize dashboards hierarchically: cluster overview → node details → workload specifics. Use consistent color schemes and units across dashboards for easier interpretation. Leverage Grafana's folder and tagging system to manage dashboard sprawl.

8. Managing Node Failures

Since GPU nodes are expensive, it becomes incredibly important to efficiently handle unhealthy nodes to avoid your cloud bills from skyrocketing. This requires having strategies in place to detect these node failures, to remediate them, and to have fallback strategies in place in case graceful remediation fails.

8.1 Detection

There are a few primary ways to detect node failures and unhealthy nodes:

1. Kubernetes Health Checks

Each node runs the kubelet service which is responsible for registering the node with the control plane and sends a heartbeat signal to it every 10 seconds. It reports various node conditions:

MemoryPressure: True if node is running out of memory
DiskPressure: True if node is running out of disk space
PIDPressure: True if too many processes are running on the node
Ready: True if node is healthy and ready to accept pods
NetworkUnavailable: True if the network for the node is not correctly configured

2. Node Problem Detector

Node Problem Detector is a daemon for monitoring and reporting about a node's health. It collects information about node problems from various daemons and reports these conditions to the API server as Node Conditions for permanent issues or as Events for temporary problems.

Installation:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/node-problem-detector/master/deployment/node-problem-detector.yaml

What it monitors:

Kernel issues (deadlocks, corrupted filesystem)
Hardware problems (bad CPU, memory, disk)
Docker/containerd daemon issues
NTP service failures

8.2 Remediation

Automatic Remediation: Node Auto-Repair

Cloud-managed clusters and autoscalers often automatically replace unhealthy nodes. Karpenter, for example, automatically terminates nodes when they have unhealthy status conditions based on the cloud provider's repair policies.

Manual Remediation

A dedicated ops team can remediate nodes manually by:

Cordoning the node to prevent new pods: kubectl cordon <node>
Draining the node to evict pods: kubectl drain <node> --ignore-daemonsets
Debugging or replacing the node
Uncordoning it when the issue is fixed: kubectl uncordon <node>

8.3 Fallbacks

It is important to have fallbacks in place, in case any of your automated remediation methods fail. Karpenter, for example, won't delete unhealthy nodes if they make up more than 20% of the NodePool. This is because situations like these might be indicative of an issue with the cluster/deployments.

In situations like these, set CloudWatch Alarms (or equivalent) that trigger an SNS notification and a Lambda function with custom logic to remediate the situation.

Node Failure Detection and Remediation Flow

Kubelet Heartbeat Monitoring

Kubelet sends heartbeat every 10 seconds. Missing heartbeats trigger NotReady/Unknown status.

✓ Ready

✗ NotReady

Node Problem Detector

Monitors deeper issues and reports as Node Conditions or Events

Kernel Issues

Hardware Failures

Docker Problems

NTP Drift

Karpenter Automatic Remediation

Automatically terminates unhealthy nodes and provisions replacements

✓Terminates nodes with unhealthy status

✓Provisions replacement nodes

⚠Safety: Won't delete if >20% NodePool unhealthy

If auto-repair fails or blocked...

CloudWatch Alarms → SNS → Lambda

Fallback mechanism for manual intervention and custom remediation logic

CloudWatch

Alarm triggers

→

SNS

Notification

→

Lambda

Custom logic

Manual Remediation Steps

1. Cordon

kubectl cordon <node>

2. Drain

kubectl drain <node>

3. Fix/Replace

Debug or replace node

4. Uncordon

kubectl uncordon <node>

Multi-layered approach ensures reliable failure detection and remediation

Key Takeaways

Running AI inference on Kubernetes at scale requires fundamentally rethinking traditional infrastructure patterns. The assumptions that make Kubernetes excellent for microservices—rapid container starts, round-robin load balancing, CPU-based autoscaling—actively work against GPU inference workloads.

Success comes from understanding the unique characteristics of inference workloads:

Match hardware to workload characteristics: Memory bandwidth drives token generation speed. Single GPUs minimize latency; multi-GPU setups optimize cost. Service-level objectives determine the optimal configuration.
Use inference-aware autoscaling metrics: Queue depth provides the most reliable scaling signal. Combine with KV cache usage and token throughput for defense-in-depth. CPU and memory utilization mislead for GPU workloads.
Implement multi-tier storage architecture: Object storage for source of truth, local NVMe for high-speed cache, DaemonSets for cache management. Model storage strategy dramatically impacts cold start times.
Build comprehensive observability from the start: DCGM + Prometheus + Grafana for metrics. OpenTelemetry with OpenLLMetry for tracing. Track resource, request, and business metrics. Calculate cost per token, not just infrastructure spend.
Deploy intelligent routing: KV cache-aware routing avoids recomputation penalties. Queue depth balancing prevents hot spots. Criticality-based routing ensures latency SLOs for interactive requests.

The patterns and tools covered in this guide represent battle-tested approaches from production deployments serving millions of requests. Start with vLLM for inference, Karpenter for autoscaling, and DCGM for observability. Build from there as your specific requirements emerge.

GPU infrastructure is expensive, but running it inefficiently is more expensive. The investment in understanding these patterns pays dividends in reduced costs, improved reliability, and better user experience.

AI Inference on Kubernetes: A Production Guide

Configure GPU Infrastructure

Optimize Performance & Scaling

Monitor Inference Metrics

AI Inference on Kubernetes: A Production Guide

By the end of this guide, you'll be able to:

1. Differentiate between infrastructure requirements for CPU vs GPU workloads

2. Modernize your existing K8s stack to extract peak performance for inference workloads

3. Set up observability stack to monitor inference-specific metrics

Introduction: When Your Kubernetes Cluster Meets AI

The Fundamental Mismatch

CPU Microservices vs GPU Inference Scaling

CPU Microservices

GPU Inference

PART 1: Understanding AI Workloads on Kubernetes

The Architecture of Inference

1.1 The Inference Request Lifecycle

Inference Request Pipeline

1. Tokenization

2. Prefill

3. Decode

1.2 The Hidden State: KV Cache

KV Cache Growth Over Conversation

1.3 Why Batching Changes Everything

Dynamic Batching in LLM Inference

Single Request (No Batching)

Batched Processing (vLLM Continuous Batching)

1.4 Choosing Your Inference Engine

vLLM: The Production Workhorse

SGLang: Exploiting Patterns

TensorRT-LLM: Maximum Performance, Maximum Complexity

Triton: The Orchestration Layer

1.5 Kubernetes Deployment Considerations

2. Jobs

2.1 Overview

Batch Inference

Best Practices

Failure Handling and Pod Failure Policy

Timeout and Long-Running Jobs

CronJobs for Scheduled Batch Work

Gang Scheduling for Distributed Jobs

2.2 Finetuning

2.2.1 Axolotl

Why Axolotl

Multi-GPU Training Strategies

2.2.2 LoRA Management

2.3 Image/Video Workloads

2.3.1 ComfyUI: Node-Based Workflows

PART 2: Understanding the Infrastructure

3. Hardware

3.1 Choosing the Right Machine for the Job

3.1.1 Calculating the Model's Memory Requirements

1. Model Weights

2. KV Cache

Examples:

GPT-3 175B (FP16, Batch size = 1, Max len = 2048 tokens)

Llama 2 7B (FP16, Batch size = 1, Max len = 2048 tokens)

3.1.2 Key GPU Specifications and Their Impact

3.1.3 Single vs. Multi-GPU Trade-offs

The Single Large GPU Approach (1x H100)

The Multi-GPU Approach (2x L40S)

Example

3.1.4 How Service-Level Objectives Drive Hardware Decisions

Real-Time Interactive Applications (Ultra-Low Latency SLO)

Batch Processing Workloads (High Throughput SLO)

3.2 Provisioning GPUs

3.2.1 GPU Node Pools: Managing Heterogeneous GPU Infrastructure

Karpenter: Dynamic GPU Node Provisioning

Node Autoprovisioning: Eliminating Manual Node Pool Management

3.2.2 NVIDIA Device Plugin and GPU Operator

NVIDIA Device Plugin

NVIDIA GPU Operator

3.2.3 GPU Scheduling and Allocation

Resource Requests and Limits

Node Selectors and Tolerations

GPU Sharing Strategies

Multi-Instance GPU (MIG)

Time-Slicing

Multi-Process Service (MPS)

3.3 GPU Access Patterns