Home
Blog
GitHub
X
Back to blog

AI Agent Evals Are the New Unit Tests

June 29, 2026
AI EngineeringAgentsEvaluation

AI agent demos are useful. They show possibility. They help teams build intuition. They make abstract capabilities feel real enough to debate.

But demos are not proof.

That distinction matters more every month. As agents move from chat windows into development workflows, research pipelines, business operations, and customer-facing systems, the serious question is no longer "Can the agent do something impressive once?" The serious question is: can it do the right thing reliably, under realistic constraints, across enough messy cases that a team can trust it?

That is why evals are becoming one of the most important engineering surfaces in AI.

For traditional software, unit tests gave developers a compact way to encode expectations and catch regressions. For AI agents, evals are starting to play a similar role, but the shape is more complex. Agents use tools. They modify state. They make multi-step plans. They recover from failures. They sometimes find loopholes in the task definition. And when they are wrong, they can be wrong in ways that look superficially plausible.

The best agent builders are going to be the ones who can design systems that are not only impressive, but measurable.

The Shift From Output To Outcome

A simple LLM evaluation can ask: did the model return a good answer?

An agent evaluation has to ask more:

  • Did it choose the right tools?
  • Did it preserve the user's intent across multiple turns?
  • Did it modify the right files, records, or external state?
  • Did it reach the correct final outcome?
  • Did it do so safely, cheaply, and within an acceptable number of steps?
  • Did the system recover when a tool failed?
  • Did it create hidden damage while appearing successful?

This is the core difference between output-based evaluation and outcome-based evaluation.

A coding agent that says "I fixed the bug" has not proven anything. The outcome is whether the test suite passes, the regression is covered, the diff is reviewable, and the surrounding behavior still works. A research agent that produces a polished report has not proven reliability either. The outcome is whether the sources are real, the claims are grounded, the synthesis is useful, and the system can repeat that quality without cherry-picking.

Anthropic's recent eval guidance frames this well: agent systems are hard to evaluate because they operate over many turns, use tools, modify state, and adapt based on intermediate results. That is exactly what makes them valuable, and exactly what makes one-off demonstrations insufficient.

Benchmarks Are Useful, But They Are Not The Finish Line

Recent agent benchmarks make the reliability gap visible.

ORAgentBench evaluates agents on end-to-end operations research tasks. These are not toy prompts. Agents receive operational briefs, multi-file data, configuration artifacts, execution environments, and hidden validators. They have to write and run solution code, satisfy hard constraints, and produce decision artifacts with measurable objective quality.

The headline result is sobering: the best tested agent passed 35.51% of all tasks and 20.59% of hard tasks.

That does not mean agents are bad. It means realistic work is hard. It also means benchmarks that include execution, feasibility, and hidden validation are doing their job. They expose the distance between plausible output and dependable operational performance.

Benchmarking Agentic Review Systems shows a different version of the same pattern. AI review systems can track quality signals and catch many injected errors, but even the strongest configuration caught 71.6% of injected errors. Useful, yes. Complete, no. The remaining gap matters because the cost of missed errors is not evenly distributed.

InnovatorBench pushes the question into AI research itself. It evaluates agents on long-horizon research tasks such as data construction, filtering, augmentation, loss design, reward design, and scaffold construction. The paper's findings are especially relevant for builders: frontier agents show promise, but still struggle with fragile algorithmic work, long-horizon decision making, resource management, and overreliance on template-like reasoning.

These results are not reasons to dismiss agents. They are reasons to build better evals.

The Benchmark Trap: Cherry-Picked Leaderboards

There is a real risk that agent benchmarks become marketing artifacts.

Any benchmark can be overfit. Any leaderboard can be cherry-picked. A private eval can be tuned until it flatters the system that created it. A public score can hide details about the harness, sampling, retries, task exclusions, tool access, budget, or failure modes. And as soon as a benchmark becomes commercially valuable, teams have incentives to optimize for the scoreboard instead of the underlying capability.

That is why open, collaborative evaluation matters.

Not "open" as a vague brand word. Open at every layer:

  • Engineers should be able to inspect tasks, harnesses, graders, traces, and failure cases.
  • Researchers should be able to challenge methodology, add harder tasks, and identify benchmark leakage.
  • AI providers should be able to reproduce runs under shared constraints rather than publish isolated claims.
  • Product teams should be able to adapt public benchmarks into private regression suites that reflect their own workflows.
  • The community should be able to improve benchmarks when models learn to game them.

Open-source evaluation frameworks and benchmarks already point in this direction. SWE-bench evaluates coding agents against real GitHub issues. Terminal-Bench tests agents in terminal environments. Inspect AI provides an open framework for model and agent evaluations. These efforts are not perfect, but they are the right kind of imperfect: inspectable, extensible, and useful to people outside the organization that created them.

The future should not be one leaderboard to rule them all. It should be a living evaluation ecosystem.

Principal Engineers Should Think In Harnesses

For engineers, the most important mental model is this:

You are not evaluating a model. You are evaluating a model inside a harness.

The harness includes the prompt, tools, memory, retrieval, permissions, execution environment, retry logic, error handling, state management, budget limits, and grader design. Change the harness and you may change the result. This is why two systems using the same model can behave very differently in production.

Efficient Benchmarking of AI Agents makes this scaffold sensitivity explicit. Agent performance depends on the framework wrapping the model, not just the model itself. The paper also argues that reduced benchmark subsets can preserve rankings if selected carefully, especially around mid-difficulty tasks. That is a practical engineering point: exhaustive evaluation is expensive, so teams need eval suites that are both informative and runnable.

In production, this leads to a layered eval strategy:

  1. Capability evals: hard tasks that reveal what the agent can and cannot do yet.
  2. Regression evals: stable tasks that should pass nearly every time.
  3. Outcome checks: deterministic validators for final state, files, database rows, or external artifacts.
  4. Transcript checks: tool usage, unnecessary steps, unsafe actions, and failure recovery behavior.
  5. Human review: calibration for subjective quality and edge cases.
  6. Cost and latency tracking: because an agent that succeeds too slowly or expensively may still fail the product.

This is where senior engineering judgment matters. A weak eval suite creates false confidence. A strong eval suite creates an honest feedback loop.

Evals As Product Infrastructure

The best agent evals should look less like benchmark theater and more like product infrastructure.

If an AI coding assistant is expected to fix bugs, its evals should include real repositories, failing tests, expected patches, linting, type checks, security checks, and reviewability criteria. If a research agent is expected to write briefs, its evals should include source verification, citation quality, contradiction handling, recency checks, and synthesis quality. If an operations agent is expected to make business decisions, its evals should validate constraints, final state, and downstream impact.

The important move is to encode the product's actual definition of success.

That usually means combining deterministic checks with judgment-based checks. Unit tests and validators catch hard failures. LLM judges can help with nuance, but they need rubrics and calibration. Human reviewers remain essential for setting the bar, especially in early versions of a system. Over time, the strongest human judgments can become rubrics, and the strongest rubrics can become automated checks.

This mirrors how good engineering teams already work. You do not replace judgment with tests. You use tests to preserve judgment at scale.

A Better Way To Talk About Agent Progress

The AI industry likes clean narratives: model X beats model Y, benchmark score goes up, agents are solved, agents are doomed.

Reality is more interesting.

Agent progress is not a single number. It is a stack of capabilities:

  • Can the model reason through the task?
  • Can the harness expose the right tools?
  • Can the system choose actions under uncertainty?
  • Can it recover from partial failure?
  • Can it verify its own work?
  • Can it produce a useful artifact?
  • Can it do all of that repeatedly under cost, latency, safety, and quality constraints?
A benchmark score is a useful signal. A reproducible eval harness is leverage.

That is why evals deserve more attention from builders. They are not just a research artifact. They are the bridge between model capability and product reliability.

The teams that win with agents will not be the teams with the flashiest demo. They will be the teams with the clearest feedback loops. They will know what their agents are good at, where they fail, how those failures change across model upgrades, and which improvements actually matter to users.

The Practical Takeaway

If you are building with agents, start treating evals like first-class infrastructure.

A Minimal Eval Loop

The most useful first version is usually small enough to run constantly. A team does not need a grand leaderboard on day one. It needs a boring command that can answer: did this change make the agent better, worse, slower, riskier, or more expensive?

The habit to build is simple: run the eval before you trust the demo, then inspect the trace when the result surprises you.

Terminal
$ pnpm agent:eval --suite regression --model candidate
running 48 outcome checks against candidate
pass rate: 45/48
regressions: 2
cost delta: +7.8%
next step: inspect traces for failed tool calls

What The Command Should Save

At minimum, save the task input, final outcome, tool-call transcript, grader output, model and harness versions, token/cost metrics, and the exact environment used for the run. The point is not just to get a score. The point is to make the failure inspectable enough that another engineer can reproduce it.

Do not wait until the system is "done." Write evals while the product is still being shaped. Use them to clarify what good means. Use them to compare model and harness changes. Use them to catch regressions. Use them to prevent demos from becoming false confidence.

And when you use public benchmarks, use them seriously but skeptically. Ask what the benchmark measures, what it leaves out, whether the harness is comparable, whether the task distribution reflects your use case, and whether the results are reproducible.

The healthiest future for AI agents is not private leaderboard bragging. It is open, collaborative, outcome-based evaluation across engineers, researchers, product teams, and AI providers.

That is how agent development becomes less about impressive moments and more about dependable systems.

Sources And Further Reading

  • Anthropic: Demystifying evals for AI agents
  • Benchmarking Agentic Review Systems
  • ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
  • InnovatorBench: Evaluating Agents' Ability to Conduct Innovative AI Research
  • Efficient Benchmarking of AI Agents
  • Inspect AI
  • SWE-bench
  • Terminal-Bench

Enjoyed this post?

Get new posts delivered to your inbox — no spam, just signal.

Subscribe on Substack