In 1965, Edsger Dijkstra introduced the Banker's Algorithm for deadlock avoidance in operating systems. The core insight was deceptively simple: before granting any resource request, simulate the allocation and verify the resulting state is safe.
A "safe state" means all processes can eventually complete. If granting a request would leave the system in an unsafe state—where deadlock becomes possible—the request is denied, and the process waits.
1. Process requests resources 2. Algorithm SIMULATES granting the request 3. Checks if resulting state is SAFE (safe = exists a sequence where all processes complete) 4. IF safe → Grant request 5. ELSE → Deny request, process waits Invariant: Never enter a state you cannot safely exit.
Sixty years later, I found myself applying the same principle to distributed systems running on volatile cloud infrastructure. The domains are different—processes and memory vs. containers and cloud APIs—but the invariant is identical.
Traditional distributed systems fail cryptically. A deployment times out. A health check fails. An instance terminates unexpectedly. The operator sees Error: deadline exceeded and begins the debugging ritual: check logs, check metrics, check configs, guess.
Observability-Driven Development inverts this. Before executing any operation, the system observes its complete state space and verifies the path ahead is clear. If preconditions aren't met, the system doesn't fail—it guides the operator to resolution.
User invokes: synkti --project-name my-app
1. System observes entire state space:
├── Where am I running? (EC2 instance or local machine?)
├── Does infrastructure exist? (S3 buckets, IAM roles, security groups)
├── Is the orchestrator binary in S3?
├── Are model weights uploaded?
└── Are peer instances discoverable?
2. Checks if state is SAFE (all dependencies present)
3. IF safe → Execute (start orchestrator or monitoring)
4. ELSE → Don't proceed. Guide user to safe state:
"Missing: orchestrator binary in S3
Run: ./scripts/upload-binary.sh --project-name my-app"
Invariant: Never enter a state you cannot safely complete.
The mapping to Dijkstra's algorithm is direct:
| Banker's Algorithm | Synkti (Distributed Systems) |
|---|---|
| Available resources matrix | S3 buckets, IAM roles, running instances |
| Maximum demand matrix | Required dependencies (binary, model weights) |
| Allocation matrix | Current state (infrastructure created? deps uploaded?) |
| Safe state check | is_safe_to_proceed() verification |
| Grant/Deny request | Execute operation / Guide user to fix |
The result: no blind failures. Every error message tells you exactly what's wrong and how to fix it. The system self-diagnoses before attempting operations that would fail.
Kubernetes solved container orchestration with a centralized control plane: an API server, etcd for state, a scheduler, and various controllers. This architecture has a fundamental flaw: state drift.
The control plane maintains a model of the cluster in etcd—a representation of what it believes is true. But the map is not the territory. Reality exists at the nodes. The model is always an approximation, always slightly stale, always drifting from truth.
┌─────────────────────────────────────────────────────────────┐
│ Control Plane (API Server + etcd) │
│ "Desired state: 5 replicas, all healthy" │
└─────────────────────────────────────────────────────────────┘
│ reconciliation loop
▼
┌─────────────────────────────────────────────────────────────┐
│ Reality (Nodes) │
│ "Actual state: 3 running, 1 pending, 1 OOMKilled" │
└─────────────────────────────────────────────────────────────┘
The control plane's model diverges from reality.
Reconciliation is perpetually playing catch-up.
Network partitions turn drift into divergence.
This creates the reconciliation tax: continuous CPU cycles diffing desired vs. actual state, network bandwidth syncing state to the center, exponential complexity handling edge cases where the model has diverged from reality.
In P2P choreography, there is no central model to drift. Each node IS the authoritative source of its own state. When you need to know a node's status, you ask the node. No stale cache. No fiction.
┌───────────────────────────────────────────────────────────┐ │ Kubernetes (Centralized) │ │ │ │ ┌─────────────┐ │ │ │ Control │◄─── reconciliation loop ───► │ │ │ Plane │ (continuous overhead) │ │ └──────┬──────┘ │ │ │ commands │ │ ┌──────┴──────┬──────────┬──────────┐ │ │ ▼ ▼ ▼ ▼ │ │ Node A Node B Node C Node D │ │ (passive) (passive) (passive) (passive) │ │ │ │ Single source of truth: Control Plane │ │ Failure mode: SPOF, state drift, split-brain │ └───────────────────────────────────────────────────────────┘ ┌───────────────────────────────────────────────────────────┐ │ Synkti (P2P Choreography) │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Node A │◄──►│ Node B │◄──►│ Node C │ │ │ │ (self- │ │ (self- │ │ (self- │ │ │ │ governing)│ │ governing)│ │ governing)│ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ │ │ │ └─────────────────┴─────────────────┘ │ │ │ │ │ EC2 Tags (discovery) │ │ SynktiCluster=my-app │ │ SynktiRole=worker │ │ │ │ Source of truth: Each node (distributed) │ │ Failure mode: None (no central point of failure) │ └───────────────────────────────────────────────────────────┘
Each Synkti node:
No central coordinator means no single point of failure, no single point of control, and no state to drift. The "cluster state" is simply the set of tagged instances—always consistent with reality by construction.
In C++, RAII (Resource Acquisition Is Initialization) ties resource lifetime to object scope. When an object goes out of scope, its destructor runs and resources are freed automatically. No manual cleanup. No resource leaks.
// C++ RAII
{
std::vector<int> v(1000); // Acquire memory
// Use v
} // Memory automatically freed (destructor runs)
The same principle applies to cloud infrastructure:
// Rust RAII for cloud infrastructure
{
let infra = Infrastructure::new("my-experiment").await?;
// Use infra (S3 buckets, IAM roles, instances)
} // Infrastructure automatically destroyed (Drop runs)
This inverts the traditional mental model. Infrastructure is not real estate you accumulate—it's a library dependency you borrow. When you're done, you return it.
| Resource Type | Lifetime | Management | Rationale |
|---|---|---|---|
| S3 orchestrator binary | Permanent | Manual | Intelligence—versioned |
| S3 model weights | Permanent | Manual | Intelligence—expensive to re-download |
| S3 checkpoints | Ephemeral | Automatic | Runtime state—auto-expires |
| IAM roles | Permanent | Manual (one-time) | Configuration—propagation delay |
| Instance profile | Permanent | Manual (one-time) | Configuration—tied to IAM |
| Security groups | Permanent | Manual (one-time) | Configuration—$0 cost |
| EC2 instances | Ephemeral | Automatic | Dumb compute—managed by AWS SDK |
The pattern: intelligence is permanent, compute is ephemeral. Model weights, orchestrator logic, IAM permissions—these are carefully managed assets that persist across deployments. EC2 instances are disposable workers, abstracted away except for the first and second-order effects of the algorithms they execute.
The result: no resource leaks, no accumulating cloud bills from forgotten test infrastructure, and a clean slate for each deployment.
Traditional operations treats servers as pets. You name them, configure them carefully, nurse them when sick. When one dies, it's a crisis requiring human intervention.
Fungible compute inverts this. Instances are disposable. Intelligence persists elsewhere.
┌───────────────────────────────────────────────────────────┐
│ PERMANENT INTELLIGENCE (S3) │
│ s3://my-project-models/ │
│ ├── bin/synkti ← Orchestrator logic │
│ └── qwen2.5-7b/ ← Model weights │
└───────────────────────────────────────────────────────────┘
│
│ Downloaded on boot
▼
┌───────────────────────────────────────────────────────────┐
│ FUNGIBLE COMPUTE │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ EC2 Spot │ │ EC2 Spot │ │ EC2 Spot │ │
│ │ Instance │ │ Instance │ │ Instance │ │
│ │ │ │ │ │ │ │
│ │ Disposable │ │ Disposable │ │ Disposable │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ Any instance can be deleted/replaced at any time. │
│ The intelligence (orchestration logic, models) persists. │
└───────────────────────────────────────────────────────────┘
Each instance boots with a simple user-data script:
# 1. Download orchestrator from S3
aws s3 cp s3://${project}-models/bin/synkti /usr/local/bin/synkti
# 2. Download model weights
aws s3 sync s3://${project}-models/qwen2.5-7b/ /models/
# 3. Start orchestrator (which observes context and acts accordingly)
synkti --project-name ${project}
When a spot instance terminates, nothing of value is lost. A replacement boots, downloads the same intelligence from S3, discovers its peers, and joins the cluster. No configuration drift. No special recovery procedures. No operator intervention.
Stateful failover is operationally expensive. When a node fails, you must: detect the failure, find a healthy replacement, replicate state (potentially gigabytes of data), switch traffic, and hope nothing was lost in transit.
For GPU workloads, step 3 is often impossible. Docker checkpoint (CRIU) cannot serialize GPU memory. The KV cache for a 7B parameter model is ~14GB. Replicating that within a 2-minute spot termination notice is... optimistic.
Stateless failover sidesteps the problem entirely. Don't replicate state—spawn a fresh replacement.
Spot Termination Notice (120 seconds)
│
▼
┌─────────────────┐
│ 1. DRAIN (~5s) │ Stop accepting new requests
│ │ Let in-flight requests complete
└────────┬────────┘
│
▼
┌─────────────────┐
│ 2. SELECT (~1s) │ Query peers, pick replacement AZ
└────────┬────────┘
│
▼
┌─────────────────┐
│ 3. SPAWN (~35s) │ Launch new spot instance
│ │ Download binary + model from S3
└────────┬────────┘
│
▼
┌─────────────────┐
│ 4. HEALTH (~2s) │ Verify vLLM is ready
│ │ Register with load balancer
└────────┬────────┘
│
▼
Traffic restored
Total: ~45 seconds (well within 120s grace period)
No state replication. Fresh instance, fresh start.
This works because:
Traditional tooling fragments responsibility across multiple binaries: kubectl for cluster management, kubelet for node agents, helm for package management, terraform for infrastructure. The operator must know which tool to use, where to run it, and in what sequence.
A responsible intelligence binary knows where it is and what it should do. It observes its context, infers its role, and acts appropriately—without configuration files telling it which mode to operate in.
$ synkti --project-name my-app ┌───────────────────────────────────────────────────────────┐ │ Binary observes: "Where am I?" │ │ │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ LOCAL MACHINE │ │ EC2 INSTANCE │ │ │ │ │ │ │ │ │ │ Role: Deployer │ │ Role: Worker │ │ │ │ │ │ │ │ │ │ • Manage infra │ │ • Join cluster │ │ │ │ • Run terraform │ │ • Discover peers │ │ │ │ • Launch workers │ │ • Run orchestrator│ │ │ │ • Show dashboard │ │ • Self-terminate │ │ │ │ │ │ on failure │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ No configuration. No mode flags. The binary just knows. │ └───────────────────────────────────────────────────────────┘
The EC2 worker doesn't need Terraform because infrastructure was already created by you running the binary locally. The worker's responsibilities are minimal and self-contained:
The binary detects its context through layered observation—the same Banker's Algorithm principle, now applied at the application level:
async fn is_running_on_ec2() -> bool {
// Layer 1: IMDSv2 token endpoint
if check_imdsv2_token().await { return true; }
// Layer 2: Instance identity document
if check_instance_identity().await { return true; }
// Layer 3: System UUID pattern
if check_system_uuid() { return true; }
false
}
// The binary observes its state space before deciding its role
let on_ec2 = is_running_on_ec2().await;
if on_ec2 {
// Worker mode: minimal responsibilities, self-managing
run_orchestrator(project).await
} else {
// Deployer mode: infrastructure management available
deploy_and_monitor(project).await
}
This is responsible intelligence in two senses: the binary is responsible for knowing its own role, and it behaves responsibly within that role—workers don't try to manage infrastructure they shouldn't touch, deployers don't try to join clusters they're not part of.
The design is also extensible. The same binary can become a CLI tool, a worker, a monitoring agent, or anything else—determined by context observation, not hardcoded roles. Future capabilities are added by extending the observation logic, not by creating new binaries.
These patterns share an underlying philosophy:
The result is a system that:
"The system observes itself before acting. It never fails blindly—it knows what it needs and tells you exactly what's missing."
This is what Dijkstra understood in 1965, applied to the cloud era: never enter a state you cannot safely exit. The Banker's Algorithm scales from processes to distributed systems. The invariant holds.