Insights // Evals2026-03-139 min read

Measuring AI Quality: Evals Your Board Will Trust

How to measure AI feature quality in numbers leadership can act on, covering what to measure, how to avoid vanity metrics, and how to report quality over time.

V
Varun Raj ManoharanFounder & Principal Engineer
EvalsQualityMetricsCTO

Key takeaways

  • Lead with task success rate, the share of representative inputs that produced an output doing the job, measured on the same set every release.
  • Build a representative eval set of 50 to 200 inputs pulled from real logs, freeze it across releases, and version it like code.
  • Validate the LLM judge against human labels on a sample, since agreement near 60 percent means your automated score is fiction you have been reporting as noise.
  • Tie every reported metric to a decision: if a number getting worse changes nothing you do, it is a vanity metric taking up space.

Overview

If a board member asks "is the AI feature getting better or worse?" and your honest answer is "it feels solid," you have a problem. Not because the feature is bad, but because you have no number to point at, and a quality story without a number gets discounted to zero the moment someone pushes on it.

The fix is an eval: a fixed set of representative inputs, a scoring method, and a single headline metric you track across releases. Task success rate is the one I lead with. Not token counts, not latency, not how many prompts you've shipped. Whether the feature does the job the user asked for, measured on the same inputs every release, reported as a trend.

This post is about the strategy of that measurement and how to present it, not the code. I'll be honest about where it breaks down too, because the methods most teams reach for (LLM-as-judge especially) have failure modes that will quietly inflate your number if you let them.

Why "it feels good" doesn't survive a board meeting

"It feels good" has three problems in a room full of people who weren't there when you tested it.

It isn't comparable. You can't put "feels good" next to last quarter's "feels good" and say which is higher. So you can't show progress, which is the entire point of reporting.

It isn't defensible. The first skeptical question ("good for whom, on what?") has no answer, and now the conversation is about your credibility instead of the product.

It hides regressions. Models drift, prompts get edited, a dependency updates, and the thing that felt good in January quietly degrades by March. A vibe check catches none of that. A fixed eval run on every release catches it the day it happens.

The reason teams fall back on vibes is that real measurement feels like a research project. It isn't. You need 50 to 200 representative inputs and a way to score them. That's a week of work, not a quarter, and it changes every quality conversation you have after.

What to actually measure

Start with task success rate: of N representative inputs, how many produced an output that does the job. "Does the job" needs a definition you write down before you score anything, because a fuzzy bar lets you grade on a curve when the number looks bad.

Everything else is supporting cast. Latency and cost matter for the business, but they're not quality, and conflating them is how you end up reporting that the feature got "better" when it just got cheaper. Token counts measure nothing a user cares about. Number of prompts shipped measures activity, not outcomes.

Here are the metrics worth tracking, what each one tells you, and the trap that comes with it.

MetricWhat it tells youThe trap
Task success rateWhether the feature does the job, on a fixed input setUseless if the input set isn't representative of real usage
Pass rate on hard casesHow you do on the inputs that actually breakEasy to ignore because it's the number that looks worst
Regression count vs. last releaseCases that worked before and broke nowInvisible unless you keep the input set frozen across releases
Human agreement with LLM-judgeWhether your automated scorer can be trustedPeople assume it's high; it often isn't, and nobody checks
Production thumbs-down / escalation rateReal-world failure your offline set missedLags reality and undercounts (most unhappy users say nothing)
Latency, cost per requestBusiness viability of the featureNot quality; reporting it as quality hides real regressions

The pattern across the traps is the same. A metric is only as honest as the data behind it and the discipline around it. Which is why the eval set itself matters more than the scoring.

Building a representative eval set

The eval set is the whole ballgame. A 95% success rate on inputs that look nothing like real traffic is a number that will embarrass you the first time a customer hits the case you never tested.

Pull your inputs from reality. Real user queries from logs, real support tickets, real documents your feature processes. Then make sure the set covers the shape of actual usage, not just the happy path:

  • The common cases, weighted roughly how often they show up.
  • The hard cases: ambiguous inputs, edge formats, the long tail where the feature is most likely to fail.
  • The cases you've already been burned by. Every production incident becomes a permanent eval case so the same bug can never silently return.

Freeze it. Once the set exists, you don't quietly swap inputs between releases, because the moment you do, your trend line compares two different tests and means nothing. Add cases over time (especially regressions and new failure modes), but treat the core set as a stable benchmark. Version it like you version code.

Fifty cases is enough to start and far better than zero. Two hundred gives you stable numbers you can slice by category. You do not need ten thousand, and chasing that number is a common way to never ship the eval at all.

Offline evals versus production signals

These measure two different things, and you need both.

Offline evals run against your frozen set before you ship. They're controlled and repeatable, so they answer "did this change make the feature better or worse?" That's your release gate and the source of your trend line.

Production signals come from real users after you ship: thumbs-down rates, escalations to a human, retries, abandonment, support tickets tagged to the feature. They catch the cases your eval set didn't anticipate, which there will always be, because real users are more creative than your test data.

The honest limit of each: offline evals only test what you thought to include, so they miss the unknown unknowns. Production signals catch those, but they lag (you find out after users are already unhappy) and they undercount badly, because most frustrated users just leave without clicking thumbs-down. Run both and feed production failures back into the offline set. That loop is the engine that makes your eval set more representative over time, and it's the core of disciplined MLOps and AgentOps work: measure offline, watch production, close the loop.

Avoiding vanity metrics

A vanity metric is one that reliably goes up and tells leadership nothing about whether the feature works. They're tempting precisely because they're easy to move.

The tells: it only ever improves, it has no failure threshold, and nobody can name a decision it would change. "We shipped 40 prompt iterations this quarter" is activity. "We process 2 million tokens a day" is volume. "Average response length grew 15%" is noise. None of them answer the only question that matters, which is whether the output is more useful than it was last release.

The discipline is to tie every reported metric to a decision. Before a number goes on a slide, answer: if this gets worse, what do we do? If the answer is "nothing," it's a vanity metric and it's taking up space where a real one should be. Task success rate passes this test: if it drops, you don't ship, or you roll back. That's a metric with teeth.

Tracking quality as a trend

A single number is a snapshot. A snapshot can't tell anyone whether you're improving, and improving is the story leadership actually wants. So run the same eval on every release and plot it.

The trend is where the value lives. It shows whether your team's changes move the needle, it catches regressions as a visible dip the week they happen, and it turns "trust us, it's getting better" into a line going up that anyone can read. When the line dips, you have a conversation about a specific release instead of a vague worry about quality drift.

A minimal record per release is enough to start:

ARDUINO
release   success_rate   hard_case_rate   regressions   judge_agreement
v1.4      0.86           0.61             0             0.91
v1.5      0.89           0.67             0             0.90
v1.6      0.84           0.55             3             0.88

That v1.6 row is the whole point. Success dipped, hard cases dropped more, three things that used to work broke, and you can see it before a customer tells you. Without the trend, v1.6 ships and you find out from a support queue three weeks later.

How to present it to the board

Boards don't want your eval harness. They want a scorecard they can read in thirty seconds and a story they can repeat to someone else.

Keep it to one screen. The headline metric and its direction since last report. The trend line over the last several releases. One line on what moved it. A flag on any open regression and what you're doing about it. That's the whole report. Resist the urge to show every metric you track, because a wall of numbers reads as noise and the one number that matters gets lost.

Frame it in their language. Not "task success rate is 0.89," but "the feature now handles 89 of 100 representative customer requests correctly, up from 86 last quarter, and we caught and fixed three regressions before any customer hit them." Same number, but now it's a story about reliability and discipline, which is what a board is actually evaluating when they ask about AI quality.

The honest limits of LLM-as-judge

Most teams score evals with a stronger model grading the output against a rubric, because it scales and humans don't. It's a genuinely useful tool. It is also not as trustworthy as the clean decimal it produces makes it look.

Where it fails: judges are biased toward longer and more confident-sounding answers even when they're wrong. They're inconsistent, so the same output can score differently across runs. They share the base model's blind spots, so they'll miss the exact errors your generator is prone to. And they can be gamed, because optimizing your prompt to please the judge is not the same as optimizing it to help the user.

So you measure the judge before you trust the judge. Take a sample, have a human score it, and compare. If human and judge agree 90% of the time, the judge is a decent proxy and that agreement number belongs on your scorecard. If they agree 60% of the time, your automated score is fiction and you've been reporting noise.

This is why human spot-checks don't go away. You don't need a human on every case (that defeats the point of automation), but a regular sample on every release does two jobs: it keeps the judge honest, and it catches the failure modes both your model and your judge are blind to. Skip it and your beautiful trend line might be tracking nothing real.

Where to start

You don't need the full system to get out of the vibes trap. Pick the one feature leadership asks about most. Pull 50 representative inputs from your logs. Write down what "success" means for that feature, then score the current version, by hand if you have to. That single number, with a date on it, is already more than most teams can produce.

Then automate the scoring, validate it against humans, add your past incidents as permanent cases, and run it every release. Within a quarter you have a trend line instead of a feeling, and the next time the board asks whether the AI is getting better, you point at the line and move on to the next agenda item.