I killed a worker mid-payment to test “exactly-once” execution

Distributed systems often claim “exactly-once” execution. In practice, this is usually implemented as at-least-once delivery + retries + idempotency keys.

This works for deterministic code. It breaks for irreversible side effects (AI agents, LLM calls, physical infrastructure).

I wanted to see what actually happens if a worker crashes after a payment is made but before it acknowledges completion. So I built a minimal execution kernel with one rule: User code is never replayed by the infrastructure.

The kernel uses:

  1. Leases (Fencing Tokens / Epochs)
  2. A reconciler that recovers crashed tasks
  3. Strict state transitions (No silent retries)

I ran this experiment:

  1. A worker claims a task to process a $99.99 payment
  2. The worker records the payment (irreversible side effect)
  3. I kill -9 the worker before it sends completion to the DB
  4. The lease expires, the reconciler detects the zombie task
  5. A new worker claims the task with a new fencing token
  6. The new worker sees the previous attempt in the ledger (via app logic) and aborts
  7. The task fails safely

Result: Exactly one payment was recorded. The money did not duplicate.

Most workflow engines (Temporal, Airflow, Celery) default to retrying the task logic on crash. This assumes your code is idempotent.

  • AI agents are not.
  • LLM generation is not.
  • Payment APIs (without keys) are not.

I open-sourced the kernel and the chaos demo here. The point isn’t adoption. The point is to make replay unsafe again.

https://github.com/abokhalill/pulse

submitted by /u/AdministrativeAsk305
[link] [comments]

Read more on Reddit Programming