← Home Synkti

Why Stateless Failover: An Engineering Deep Dive

January 21, 2026 • by Bobby Mathews

TL;DR We evaluated OS-level container checkpointing (CRIU) for warm migration on GPUs. We rejected this approach because the Linux kernel cannot serialize GPU memory without latency penalties that far exceed the cost of simple HTTP retry. The result: a stateless system that is 6x faster and economically superior (ROI analysis shows 1,826 years to break even on checkpoint infrastructure).

The Problem: Spot Instances Need Fault Tolerance

Spot instances cost 70-80% less than on-demand, but they come with a catch: the cloud provider can reclaim them with just 2 minutes notice. When that happens, your running inference workload needs to move to a new instance.

The question is: how do we preserve state during that move?

Option 1: Container Checkpointing (What We Investigated)

Docker offers a built-in checkpoint feature:

docker checkpoint create my_container my_checkpoint
docker checkpoint restore --checkpoint my_container my_checkpoint

This seems perfect: save the entire container state to disk, transfer to new instance, restore and continue. No recomputation, no lost tokens.

There's just one problem: it doesn't work with GPUs.

Why Docker Checkpoint Fails on GPUs

Docker checkpoint relies on CRIU (Checkpoint/Restore In Userspace), which can only snapshot memory accessible to the Linux kernel.

Component Can CRIU Save It? Why?
CPU registers ✓ Yes Exposed by kernel
System RAM ✓ Yes Process memory pages
GPU VRAM ✗ No Separate from system RAM, not accessible to CRIU
CUDA contexts ✗ No NVIDIA driver state, not exposed to kernel
TPU HBM ✗ No TPU matrix unit state is opaque

When an LLM inference engine runs, the model weights and KV cache reside in GPU memory. CRIU cannot see or save that memory—it's behind the NVIDIA/TPU driver, not exposed to the OS.

Attempting docker checkpoint create on a GPU container either fails immediately, hangs indefinitely, or creates a "checkpoint" that cannot be restored.

Option 2: Application-Level Checkpointing (What SpotServe Did)

The SpotServe research paper (Miao et al., ASPLOS'24) describes successful checkpoint migration on GPUs. How did they do it?

They didn't use Docker checkpoint. They built custom application-level checkpointing:

SpotServe Architecture
GPU Instance
Context Daemon (Persistent Process)
• Model weights in GPU VRAM
• KV cache in GPU VRAM
↓ CUDA IPC (shared GPU memory)
Inference Engine (Custom FasterTransformer)
• Token-level checkpoint hooks
• NCCL transfer to target instance

What SpotServe Required

Why This Didn't Work For Us

We use vLLM as a black box—it's an off-the-shelf, actively-maintained inference engine. We don't have access to its internal KV cache structures, and forking it to add checkpoint hooks would mean:

Estimated effort: 2-6 months of dedicated engineering work.

Option 3: Stateless Failover (What We Chose)

If we can't preserve GPU state, what's the alternative? Don't try to.

HTTP is inherently stateless. If a request fails, the client retries. For inference:

Stateless Failover Flow
! Spot Preemption Notice — 120 second warning
1 Mark instance as "draining" (load balancer stops new requests)
2 Let in-flight requests complete (exploit grace period)
3 Gracefully stop container
4 Launch replacement spot instance
5 Client retries hit new instance with fresh state
Complete — Service restored

Why This Works for Inference

The ROI Analysis: Why Stateless Wins

Even if we could build application-level checkpointing, is it worth it?

Metric Value
Cost to rebuild 32k tokens (worst-case preemption) $0.012 (1.2¢)
Cost to build checkpoint system $8,000 (2 weeks × $100/hr)
Break-even point 666,666 interruptions
At once/day preemption rate 1,826 years

Paying 1.2¢ to retry a request is infinitely cheaper than spending $8,000 to build checkpoint infrastructure. The ROI doesn't close.

Performance Comparison

Approach Works on GPU? Requires Custom Engine? Engineering Time Recovery Time
SpotServe-style ✓ Yes ✓ Yes (custom) 2-6 months ~100s (checkpoint transfer)
Docker checkpoint ✗ No ✗ No N/A (doesn't work) N/A
Stateless failover ✓ Yes ✗ No (vLLM black box) 1-2 weeks ~17s (spawn & load)

The Strategic Choice

We chose stateless failover not because we couldn't build SpotServe's system, but because using vLLM as a black box is strategically better:

Conclusion

Engineering is about making trade-offs. The "coolest" technical solution (application-level checkpoint migration) is not always the right business solution.

Our stateless failover approach delivers the same 70% cost savings with 6x faster recovery time, 10x simpler codebase, and a fraction of the engineering investment.

In the end, we chose the solution that works reliably in production, not the one that looks most impressive on a whiteboard.

References


← Back to Synkti