Replay-Driven Development

A new paradigm for building and debugging agent systems using deterministic execution replays.

Debugging the Non-Deterministic

Debugging traditional agent systems is an exercise in frustration. You observe a failure, but when you re-run the agent with the same inputs, it takes a different path and succeeds. Or it fails in a different way. The non-determinism that makes LLMs creative also makes them nearly impossible to debug systematically.

Traditional software engineering has well-established debugging workflows: set a breakpoint, reproduce the bug, step through the code, fix it, write a regression test. Each of these steps assumes determinism. When your “code” is an LLM that produces different outputs each time, breakpoints are meaningless, reproduction is impossible, and regression tests are probabilistic.

Enter Replay-Driven Development

Replay-Driven Development (RDD) is a new paradigm that solves this problem by making agent executions fully reproducible. The core idea is simple: every agent execution is recorded as a deterministic execution graph, and any execution can be replayed identically, modified, and re-tested.

The workflow looks like this:

1. Capture — The agent executes in production. The execution layer records the full graph: inputs, decisions, actions, outputs, timing, and context.
2. Replay — When a bug is reported, replay the exact execution locally. See the same decisions, the same actions, the same failure — deterministically.
3. Modify — Fork the execution graph at the point of failure. Modify the graph to fix the issue: add a validation step, change an action, adjust a precondition.
4. Verify — Re-run the modified graph against the original inputs. Confirm the fix produces the correct output.
5. Deploy — Promote the fixed graph to production. All future executions with matching contexts use the corrected path.

# Replay a failed production execution
$ sudoexec replay exec-2026-03-15-a4f2c
> replaying 23 steps from production trace...
> step 18: FAIL - precondition `schema_migrated` not met
✗ reproduced failure at step 18

# Fork and fix
$ sudoexec fork exec-2026-03-15-a4f2c --at-step 17
> forked graph. editing step 17...

# Insert migration step before the failing step
$ sudoexec insert --after 17 --action run_migration
> inserted step 17b: run_migration

# Verify the fix
$ sudoexec replay exec-2026-03-15-a4f2c --with-fork
✓ all 24 steps passed

Regression Testing for Agents

One of the most powerful features of RDD is deterministic regression testing. When you fix a bug in an execution graph, the original failed execution becomes a test case. The fixed graph must pass against the original inputs and produce the correct output. This is a real, deterministic test — not a probabilistic assertion that the LLM will probably do the right thing.

Over time, your library of execution traces becomes a comprehensive test suite. Every production failure that gets fixed adds a new regression test. Your agent system accumulates knowledge of edge cases, failure modes, and correct behaviors — all encoded as deterministic, replayable graphs.

Collaborative Debugging

Execution graphs are shareable artifacts. When an agent fails in production, the on-call engineer can export the execution trace and share it with the team. Anyone can replay it locally, see exactly what happened, and propose fixes — all without needing access to the production environment or the original LLM context.

This transforms agent debugging from a solo detective exercise into a collaborative engineering activity. The execution graph becomes the shared language for discussing agent behavior, just as code is the shared language for discussing software behavior.

Beyond Debugging

RDD extends beyond debugging into agent development itself. When building a new agent capability, you can prototype the execution graph manually, test it with real data, and only then connect it to the LLM for dynamic path selection. This inverts the traditional approach of letting the LLM figure out the execution path and hoping it gets it right.