Early · building in the open
A note on testing agents that don't give the same answer twice
We're a small team building testpath. This isn't a product page — it's a note on what we're trying to build, and where we honestly are.
The hypothesis
A team running its own LLM agent in production — a support agent, say — struggles to tell a real regression from run-to-run noise. The agent is stochastic, and so is the model grading it, so the same eval scores differently each time. When a case moves from 0.91 to 0.78 between versions, there's no principled way to know if something broke or the dice just landed differently.
We think this is a statistics problem, not a tooling one — and that it's tractable. We might be wrong, and we'd like to find out.
What we're trying to build
Regression testing for stochastic agents, in three parts:
- 01 Measure. Report every eval as a pass rate with an error bar, not a lone number that hides the wobble.
- 02 Decide. Collapse that interval into a CI verdict — green, red, or an honest orange when the data can't yet tell.
- 03 Economize. Sample sequentially and stop the moment the interval clears the line — certainty costs only what it must.
None of the method is ours or new — it's the statistics in Anthropic's Adding Error Bars to Evals and the wider literature. What's missing is a tool that packages it for the regression-on-every-change loop teams actually run.
We are planning an open-source (coming-soon) tool, with a hosted CI service.
Where we are
Early, and honest about it. The PyPI package is a name reservation — the CLI prints its help and little else. We'll be building the runner in the open. Nothing here to buy, and won't be for a while.
What we're looking for
Feedback, not customers. We'd like to talk to a few people running a customer-facing LLM agent in production — fifteen minutes, no pitch — to learn whether the problem we see is the one you have. If the premise is wrong, that's the most useful conversation of all.