Early · building in the open

A note on testing agents that don't give the same answer twice

We're a small team building testpath. This isn't a product page — it's a note on what we're trying to build, and where we honestly are.

The hypothesis

A team running its own LLM agent in production — a support agent, say — struggles to tell a real regression from run-to-run noise. The agent is stochastic, and so is the model grading it, so the same eval scores differently each time. When a case moves from 0.91 to 0.78 between versions, there's no principled way to know if something broke or the dice just landed differently.

We think this is a statistics problem, not a tooling one — and that it's tractable. We might be wrong, and we'd like to find out.

What we're trying to build

Regression testing for stochastic agents, in three parts:

01 Measure. Report every eval as a pass rate with an error bar, not a lone number that hides the wobble.
02 Decide. Collapse that interval into a CI verdict — green, red, or an honest orange when the data can't yet tell.
03 Economize. Sample sequentially and stop the moment the interval clears the line — certainty costs only what it must.

None of the method is ours or new — it's the statistics in Anthropic's Adding Error Bars to Evals and the wider literature. What's missing is a tool that packages it for the regression-on-every-change loop teams actually run.

We are planning an open-source (coming-soon) tool, with a hosted CI service.

Where we are

Early, and honest about it. The PyPI package is a name reservation — the CLI prints its help and little else. We'll be building the runner in the open. Nothing here to buy, and won't be for a while.

What we're looking for

Feedback, not customers. We'd like to talk to a few people running a customer-facing LLM agent in production — fifteen minutes, no pitch — to learn whether the problem we see is the one you have. If the premise is wrong, that's the most useful conversation of all.

Say hello — a short call →