I killed a worker mid-payment to test “exactly-once” execution
Distributed systems often claim “exactly-once” execution. In practice, this is usually implemented as at-least-once delivery + retries + idempotency keys.
This works for deterministic code. It breaks for irreversible side effects (AI agents, LLM calls, physical infrastructure).
I wanted to see what actually happens if a worker crashes after a payment is made but before it acknowledges completion. So I built a minimal execution kernel with one rule: User code is never replayed by the infrastructure.
The kernel uses:
- Leases (Fencing Tokens / Epochs)
- A reconciler that recovers crashed tasks
- Strict state transitions (No silent retries)
I ran this experiment:
- A worker claims a task to process a $99.99 payment
- The worker records the payment (irreversible side effect)
- I
kill -9the worker before it sends completion to the DB - The lease expires, the reconciler detects the zombie task
- A new worker claims the task with a new fencing token
- The new worker sees the previous attempt in the ledger (via app logic) and aborts
- The task fails safely
Result: Exactly one payment was recorded. The money did not duplicate.
Most workflow engines (Temporal, Airflow, Celery) default to retrying the task logic on crash. This assumes your code is idempotent.
- AI agents are not.
- LLM generation is not.
- Payment APIs (without keys) are not.
I open-sourced the kernel and the chaos demo here. The point isn’t adoption. The point is to make replay unsafe again.
submitted by /u/AdministrativeAsk305
[link] [comments]