Spot instances cost 70-80% less than on-demand, but they come with a catch: the cloud provider can reclaim them with just 2 minutes notice. When that happens, your running inference workload needs to move to a new instance.
The question is: how do we preserve state during that move?
Docker offers a built-in checkpoint feature:
docker checkpoint create my_container my_checkpoint
docker checkpoint restore --checkpoint my_container my_checkpoint
This seems perfect: save the entire container state to disk, transfer to new instance, restore and continue. No recomputation, no lost tokens.
There's just one problem: it doesn't work with GPUs.
Docker checkpoint relies on CRIU (Checkpoint/Restore In Userspace), which can only snapshot memory accessible to the Linux kernel.
| Component | Can CRIU Save It? | Why? |
|---|---|---|
| CPU registers | ✓ Yes | Exposed by kernel |
| System RAM | ✓ Yes | Process memory pages |
| GPU VRAM | ✗ No | Separate from system RAM, not accessible to CRIU |
| CUDA contexts | ✗ No | NVIDIA driver state, not exposed to kernel |
| TPU HBM | ✗ No | TPU matrix unit state is opaque |
When an LLM inference engine runs, the model weights and KV cache reside in GPU memory. CRIU cannot see or save that memory—it's behind the NVIDIA/TPU driver, not exposed to the OS.
Attempting docker checkpoint create on a GPU container either fails immediately, hangs indefinitely, or creates a "checkpoint" that cannot be restored.
The SpotServe research paper (Miao et al., ASPLOS'24) describes successful checkpoint migration on GPUs. How did they do it?
They didn't use Docker checkpoint. They built custom application-level checkpointing:
We use vLLM as a black box—it's an off-the-shelf, actively-maintained inference engine. We don't have access to its internal KV cache structures, and forking it to add checkpoint hooks would mean:
Estimated effort: 2-6 months of dedicated engineering work.
If we can't preserve GPU state, what's the alternative? Don't try to.
HTTP is inherently stateless. If a request fails, the client retries. For inference:
Even if we could build application-level checkpointing, is it worth it?
| Metric | Value |
|---|---|
| Cost to rebuild 32k tokens (worst-case preemption) | $0.012 (1.2¢) |
| Cost to build checkpoint system | $8,000 (2 weeks × $100/hr) |
| Break-even point | 666,666 interruptions |
| At once/day preemption rate | 1,826 years |
Paying 1.2¢ to retry a request is infinitely cheaper than spending $8,000 to build checkpoint infrastructure. The ROI doesn't close.
| Approach | Works on GPU? | Requires Custom Engine? | Engineering Time | Recovery Time |
|---|---|---|---|---|
| SpotServe-style | ✓ Yes | ✓ Yes (custom) | 2-6 months | ~100s (checkpoint transfer) |
| Docker checkpoint | ✗ No | ✗ No | N/A (doesn't work) | N/A |
| Stateless failover | ✓ Yes | ✗ No (vLLM black box) | 1-2 weeks | ~17s (spawn & load) |
We chose stateless failover not because we couldn't build SpotServe's system, but because using vLLM as a black box is strategically better:
Engineering is about making trade-offs. The "coolest" technical solution (application-level checkpoint migration) is not always the right business solution.
Our stateless failover approach delivers the same 70% cost savings with 6x faster recovery time, 10x simpler codebase, and a fraction of the engineering investment.
In the end, we chose the solution that works reliably in production, not the one that looks most impressive on a whiteboard.