← Home • Synkti

Why Stateless Failover: An Engineering Deep Dive

January 21, 2026 • by Bobby Mathews

TL;DR We evaluated OS-level container checkpointing (CRIU) for warm migration on GPUs. We rejected this approach because the Linux kernel cannot serialize GPU memory without latency penalties that far exceed the cost of simple HTTP retry. The result: a stateless system that is 6x faster and economically superior (ROI analysis shows 1,826 years to break even on checkpoint infrastructure).

The Problem: Spot Instances Need Fault Tolerance

Spot instances cost 70-80% less than on-demand, but they come with a catch: the cloud provider can reclaim them with just 2 minutes notice. When that happens, your running inference workload needs to move to a new instance.

The question is: how do we preserve state during that move?

Option 1: Container Checkpointing (What We Investigated)

Docker offers a built-in checkpoint feature:

docker checkpoint create my_container my_checkpoint
docker checkpoint restore --checkpoint my_container my_checkpoint

This seems perfect: save the entire container state to disk, transfer to new instance, restore and continue. No recomputation, no lost tokens.

There's just one problem: it doesn't work with GPUs.

Why Docker Checkpoint Fails on GPUs

Docker checkpoint relies on CRIU (Checkpoint/Restore In Userspace), which can only snapshot memory accessible to the Linux kernel.

Component	Can CRIU Save It?	Why?
CPU registers	✓ Yes	Exposed by kernel
System RAM	✓ Yes	Process memory pages
GPU VRAM	✗ No	Separate from system RAM, not accessible to CRIU
CUDA contexts	✗ No	NVIDIA driver state, not exposed to kernel
TPU HBM	✗ No	TPU matrix unit state is opaque

When an LLM inference engine runs, the model weights and KV cache reside in GPU memory. CRIU cannot see or save that memory—it's behind the NVIDIA/TPU driver, not exposed to the OS.

Attempting docker checkpoint create on a GPU container either fails immediately, hangs indefinitely, or creates a "checkpoint" that cannot be restored.

Option 2: Application-Level Checkpointing (What SpotServe Did)

The SpotServe research paper (Miao et al., ASPLOS'24) describes successful checkpoint migration on GPUs. How did they do it?

They didn't use Docker checkpoint. They built custom application-level checkpointing:

SpotServe Architecture

GPU Instance

Context Daemon (Persistent Process)

• Model weights in GPU VRAM
• KV cache in GPU VRAM

↓ CUDA IPC (shared GPU memory)

Inference Engine (Custom FasterTransformer)

• Token-level checkpoint hooks
• NCCL transfer to target instance

What SpotServe Required

Custom inference engine — They built their own (FasterTransformer)
Context daemon — Separate C++ process managing GPU memory
CUDA IPC — Zero-copy GPU memory sharing between processes
NCCL — GPU-to-GPU KV cache transfer over network
Token-level hooks — Save KV cache after each generated token

Why This Didn't Work For Us

We use vLLM as a black box—it's an off-the-shelf, actively-maintained inference engine. We don't have access to its internal KV cache structures, and forking it to add checkpoint hooks would mean:

Forking vLLM and maintaining a custom fork forever
Building a custom context daemon (C++/CUDA expertise required)
Implementing CUDA IPC and NCCL transfer logic
Handling edge cases: partial transfers, version mismatches, network failures

Estimated effort: 2-6 months of dedicated engineering work.

Option 3: Stateless Failover (What We Chose)

If we can't preserve GPU state, what's the alternative? Don't try to.

HTTP is inherently stateless. If a request fails, the client retries. For inference:

Stateless Failover Flow

! Spot Preemption Notice — 120 second warning

↓

1 Mark instance as "draining" (load balancer stops new requests)

↓

2 Let in-flight requests complete (exploit grace period)

↓

3 Gracefully stop container

↓

4 Launch replacement spot instance

↓

5 Client retries hit new instance with fresh state

↓

✓ Complete — Service restored

Why This Works for Inference

No training state — We're doing inference, not training. No optimizer states, no gradients.
Model weights persist — Weights are already on disk/S3, not in volatile state.
HTTP is retry-friendly — Clients already have retry logic built in.
Only "state" is active requests — A brief retry is acceptable.

The ROI Analysis: Why Stateless Wins

Even if we could build application-level checkpointing, is it worth it?

Metric	Value
Cost to rebuild 32k tokens (worst-case preemption)	$0.012 (1.2¢)
Cost to build checkpoint system	$8,000 (2 weeks × $100/hr)
Break-even point	666,666 interruptions
At once/day preemption rate	1,826 years

Paying 1.2¢ to retry a request is infinitely cheaper than spending $8,000 to build checkpoint infrastructure. The ROI doesn't close.

Performance Comparison

Approach	Works on GPU?	Requires Custom Engine?	Engineering Time	Recovery Time
SpotServe-style	✓ Yes	✓ Yes (custom)	2-6 months	~100s (checkpoint transfer)
Docker checkpoint	✗ No	✗ No	N/A (doesn't work)	N/A
Stateless failover	✓ Yes	✗ No (vLLM black box)	1-2 weeks	~17s (spawn & load)

The Strategic Choice

We chose stateless failover not because we couldn't build SpotServe's system, but because using vLLM as a black box is strategically better:

We delegate engine maintenance to the vLLM team (100+ engineers)
We focus our engineering on orchestration logic (our actual differentiator)
We ship faster (1-2 weeks vs 2-6 months)
We have fewer moving parts (simpler = more reliable)
We work across GPU, TPU, and future accelerators (no custom CUDA code)

Conclusion

Engineering is about making trade-offs. The "coolest" technical solution (application-level checkpoint migration) is not always the right business solution.

Our stateless failover approach delivers the same 70% cost savings with 6x faster recovery time, 10x simpler codebase, and a fraction of the engineering investment.

In the end, we chose the solution that works reliably in production, not the one that looks most impressive on a whiteboard.

References

← Back to Synkti