I killed a worker mid-payment to test “exactly-once” execution - CodeGurus

Distributed systems often claim “exactly-once” execution. In practice, this is usually implemented as at-least-once delivery + retries + idempotency keys.

This works for deterministic code. It breaks for irreversible side effects (AI agents, LLM calls, physical infrastructure).

I wanted to see what actually happens if a worker crashes after a payment is made but before it acknowledges completion. So I built a minimal execution kernel with one rule: User code is never replayed by the infrastructure.

The kernel uses:

Leases (Fencing Tokens / Epochs)
A reconciler that recovers crashed tasks
Strict state transitions (No silent retries)

I ran this experiment:

A worker claims a task to process a $99.99 payment
The worker records the payment (irreversible side effect)
I kill -9 the worker before it sends completion to the DB
The lease expires, the reconciler detects the zombie task
A new worker claims the task with a new fencing token
The new worker sees the previous attempt in the ledger (via app logic) and aborts
The task fails safely

Result: Exactly one payment was recorded. The money did not duplicate.

Most workflow engines (Temporal, Airflow, Celery) default to retrying the task logic on crash. This assumes your code is idempotent.