Tutorial // Evals2026-06-1612 min read

Add an Eval Harness to Your LLM App

Stop shipping on vibes. A TypeScript tutorial for building a lightweight eval harness: datasets, scorers, an LLM-as-judge, and a CI gate.

V
Varun Raj ManoharanFounder & Principal Engineer
EvalsLLMTypeScriptTestingTutorial

Key takeaways

  • Build a dataset, a few deterministic scorers, an LLM-as-judge, a runner, and a CI gate in plain TypeScript you can run with tsx, no framework.
  • Turn every bug report into a dataset row before you fix it, so your eval set becomes a memory of every way the feature has failed.
  • Keep the judge off the hot path: run free deterministic scorers on every commit and gate on the judge only on PRs to main or nightly.
  • Set the CI threshold below where your real score sits and retry transient API failures, because a gate that fails randomly becomes decoration people override.

Here is the thing nobody admits in standups: most LLM features ship because someone typed three prompts into a dev console, the answers looked fine, and everyone moved on. That is not a test. It is a vibe. And vibes do not survive a prompt edit, a model bump, or the one customer who phrases their question slightly differently than you did.

I have watched a "tiny tweak" to a system prompt quietly break extraction on 30% of inputs, with nobody noticing for a week because the demo case still worked. The fix was not a smarter model. It was a dozen example inputs, a few scoring functions, and a number that goes down when you break something.

That number is what we are building. By the end of this post you will have a small eval harness in TypeScript: a dataset of inputs paired with expectations, a handful of deterministic scorers, an LLM-as-judge for the fuzzy cases, a runner that aggregates everything into one score, and a CI gate that fails the build when quality drops below a line you draw. No framework. Plain TypeScript you can run with tsx.

What you'll need

  • Node 18 or newer.
  • An Anthropic API key in ANTHROPIC_API_KEY (the judge calls Claude).
  • npm install @anthropic-ai/sdk and npm install -D tsx typescript.
  • A function you actually want to evaluate. I will use a toy extractContact that pulls structured data out of a sentence, but swap in your real target, a classifier, a summarizer, a RAG answer function, whatever.

The target function is the only thing you have to adapt. Everything else here is generic.

Step 1: Define a dataset

An eval dataset is just an array of cases. Each case has an input and some notion of what a good output looks like. Resist the urge to build anything fancier than this on day one.

TypeScript
// eval/dataset.ts
export interface EvalCase {
  id: string;
  input: string;
  expected: string;
}

export const dataset: EvalCase[] = [
  {
    id: "basic-contact",
    input: "Reach me at jane@acme.io or call 555-0142.",
    expected: '{"email":"jane@acme.io","phone":"555-0142"}',
  },
  {
    id: "email-only",
    input: "Drop a line to support@foundrysoft.dev whenever.",
    expected: '{"email":"support@foundrysoft.dev","phone":null}',
  },
  {
    id: "noise",
    input: "Thanks for the demo, talk soon!",
    expected: '{"email":null,"phone":null}',
  },
];

Three cases is not a real dataset, but it is a real start, and a real start beats the zero cases you have now. The cases worth adding first are not the happy path, they are the ones that have already burned you. Every time a bug report comes in, turn it into a row here before you fix it. Your dataset becomes a memory of every way the feature has failed, which is far more useful than a memory of every way it worked.

A note on expected: it does not have to be the exact string you demand back. For some scorers it is a substring you require, for the judge it is more of a reference answer. Keep it as the "ground truth" and let the scorers decide how strict to be.

Step 2: Write deterministic scorers

A scorer takes an output and the case it came from, and returns a number between 0 and 1. Deterministic scorers are the backbone of any harness, they are free, instant, and they never flake. Reach for the judge only when these cannot express what you mean.

TypeScript
// eval/scorers.ts
import type { EvalCase } from "./dataset.js";

export interface Scorer {
  name: string;
  score(output: string, testCase: EvalCase): Promise<number>;
}

export const exactMatch: Scorer = {
  name: "exact_match",
  async score(output, testCase) {
    return output.trim() === testCase.expected.trim() ? 1 : 0;
  },
};

export const contains: Scorer = {
  name: "contains",
  async score(output, testCase) {
    return output.includes(testCase.expected) ? 1 : 0;
  },
};

export const jsonValid: Scorer = {
  name: "json_valid",
  async score(output) {
    try {
      JSON.parse(output);
      return 1;
    } catch {
      return 0;
    }
  },
};

Three scorers, three different jobs. exactMatch is the strictest, output must equal expected after trimming. It is right for classification labels and other closed-vocabulary outputs, and wrong for anything with natural-language variation, where it will punish a correct answer for an extra space.

contains checks that the expected text shows up somewhere. Looser, and useful when you care that a key fact made it into a longer answer.

jsonValid ignores expected entirely and only asks: is this parseable JSON? It is a structural check, not a correctness one. I lean on this kind of scorer more than I expected to, a lot of LLM bugs are not "wrong answer," they are "answer wrapped in a markdown fence" or "trailing prose after the JSON." jsonValid catches that class instantly and cheaply, and it does so without an opinion about the content.

Notice every scorer is async even when it does not need to be. That is deliberate: it lets the judge (which genuinely is async) share the exact same interface, so the runner does not have to special-case anything.

Step 3: Add an LLM-as-judge scorer

Deterministic scorers cannot tell you whether a summary is faithful, whether a tone is right, or whether two differently-worded answers mean the same thing. For that you need judgment, and the pragmatic way to get judgment at scale is to ask another model. You give it the output, a reference, and a rubric, and it returns a score.

I will be honest about this up front: the judge is the least trustworthy part of the harness. It is a model grading a model. It has biases, it can be inconsistent, and it will occasionally rate slop as brilliant. We will deal with that in the gotchas. But used carefully, narrow rubric, constrained output, low stakes per call, it covers the cases nothing else can.

TypeScript
// eval/judge.ts
import Anthropic from "@anthropic-ai/sdk";
import type { EvalCase } from "./dataset.js";
import type { Scorer } from "./scorers.js";

const client = new Anthropic();

const RUBRIC = `You are grading whether an extraction is correct.
Compare the CANDIDATE to the REFERENCE.

Score 1.0 if the candidate captures the same email and phone as the
reference (null where the reference is null), even if formatting differs.
Score 0.5 if it gets one field right and one wrong.
Score 0.0 if it misses both or invents data that is not in the input.

Respond with ONLY a JSON object: {"score": <number>, "reason": "<short>"}`;

export const llmJudge: Scorer = {
  name: "llm_judge",
  async score(output: string, testCase: EvalCase): Promise<number> {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 512,
      system: RUBRIC,
      messages: [
        {
          role: "user",
          content: [
            `INPUT: ${testCase.input}`,
            `REFERENCE: ${testCase.expected}`,
            `CANDIDATE: ${output}`,
          ].join("\n"),
        },
      ],
    });

    const text = response.content.find((b) => b.type === "text");
    if (!text || text.type !== "text") return 0;

    try {
      const parsed = JSON.parse(text.text) as { score: number };
      // Clamp — the model occasionally returns 1.2 or -0.1.
      return Math.max(0, Math.min(1, parsed.score));
    } catch {
      // A judge that can't be parsed is a judge that failed. Score 0
      // rather than silently passing — a parse failure is a signal.
      return 0;
    }
  },
};

A few decisions worth calling out.

The rubric goes in the system prompt and is written as concretely as I can make it. "Score 1.0 if good" is useless, the judge will invent its own bar. The version above tells it exactly what 1.0, 0.5, and 0.0 mean for this task, which is the single biggest lever on judge reliability. A vague rubric produces noise; a specific one produces something you can almost trust.

I use claude-sonnet-4-6 for the judge rather than the largest available model. The judge runs once per case per eval run, so cost adds up, and Sonnet is plenty sharp for "do these two extractions match." Save the heavyweight models for the thing you are actually shipping.

The output is constrained to a small JSON object and capped at max_tokens: 512. The judge does not need to write an essay; it needs a number and a one-line reason. I parse it, and I treat a parse failure as a score of 0, not a skip. If the judge cannot follow a format this simple, I do not want to quietly count that case as fine.

And I clamp the score into [0, 1]. Models drift outside the range you asked for more often than you would hope.

Step 4: Build the runner

The runner is the loop that ties it together: for every case, run the target function, apply every scorer, and aggregate. This is the piece you will actually run from the command line.

TypeScript
// eval/runner.ts
import { dataset, type EvalCase } from "./dataset.js";
import { exactMatch, contains, jsonValid, type Scorer } from "./scorers.js";
import { llmJudge } from "./judge.js";
import { extractContact } from "../src/extract.js"; // your target function

const scorers: Scorer[] = [jsonValid, contains, llmJudge];

interface CaseResult {
  id: string;
  output: string;
  scores: Record<string, number>;
  mean: number;
}

async function runCase(testCase: EvalCase): Promise<CaseResult> {
  const output = await extractContact(testCase.input);

  const scores: Record<string, number> = {};
  for (const scorer of scorers) {
    scores[scorer.name] = await scorer.score(output, testCase);
  }

  const values = Object.values(scores);
  const mean = values.reduce((a, b) => a + b, 0) / values.length;
  return { id: testCase.id, output, scores, mean };
}

export interface EvalReport {
  results: CaseResult[];
  overall: number;
}

export async function runEvals(): Promise<EvalReport> {
  const results: CaseResult[] = [];
  for (const testCase of dataset) {
    results.push(await runCase(testCase));
  }

  const overall =
    results.reduce((sum, r) => sum + r.mean, 0) / results.length;

  return { results, overall };
}

I pick which scorers apply in the scorers array at the top. For the contact extractor I dropped exactMatch, the model returns valid JSON that means the right thing but rarely byte-for-byte matches my reference, so exactMatch would punish correct answers. That is a real decision you make per target: which scorers express what "correct" means here, and which would just add noise.

The aggregation is deliberately boring. Each case gets the mean of its scorers, and the run gets the mean of the cases. You can weight scorers later if you want, maybe jsonValid failing should tank the whole case, but start with a flat mean and only add weighting when you have a reason. Premature scoring schemes are their own kind of overfitting.

The loop is sequential. For three cases that is fine; for two hundred you will want to run them in batches with Promise.all so the judge calls overlap. I left it serial here because it is easier to read and easier to debug when a single case misbehaves, and "easy to debug" matters more than throughput when you are first wiring this up.

Now a small script to print the report:

TypeScript
// eval/report.ts
import { runEvals } from "./runner.js";

const report = await runEvals();

for (const r of report.results) {
  const flags = Object.entries(r.scores)
    .map(([name, v]) => `${name}=${v.toFixed(2)}`)
    .join("  ");
  console.log(`${r.mean.toFixed(2)}  ${r.id.padEnd(16)} ${flags}`);
}

console.log(`\nOverall: ${report.overall.toFixed(3)}`);

Run it:

Shell
npx tsx eval/report.ts

You get a per-case breakdown and one overall number. That number is the whole point. Change a prompt, rerun, watch it move. The first time you see it drop after an "improvement" you were sure about, the harness has already paid for itself.

Wiring it into CI

A score you have to remember to look at is a score nobody looks at. The harness earns its keep when it blocks a regression automatically. So we add a threshold and an exit code.

TypeScript
// eval/ci.ts
import { runEvals } from "./runner.js";

const THRESHOLD = Number(process.env.EVAL_THRESHOLD ?? "0.8");

const report = await runEvals();

console.log(`Overall score: ${report.overall.toFixed(3)}`);
console.log(`Threshold:     ${THRESHOLD.toFixed(3)}`);

// Surface the worst offenders so a failed run is actionable.
const failures = report.results
  .filter((r) => r.mean < THRESHOLD)
  .sort((a, b) => a.mean - b.mean);

for (const f of failures) {
  console.log(`  ✗ ${f.id} (${f.mean.toFixed(2)}): ${f.output.slice(0, 80)}`);
}

if (report.overall < THRESHOLD) {
  console.error(`\nEval gate failed: ${report.overall.toFixed(3)} < ${THRESHOLD}`);
  process.exit(1);
}

console.log("\nEval gate passed.");

Add a script to package.json:

JSON
{
  "scripts": {
    "eval": "tsx eval/ci.ts"
  }
}

And run it as a step in your pipeline. A minimal GitHub Actions job:

YAML
# .github/workflows/eval.yml
name: Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

process.exit(1) is what makes this a gate rather than a report. When the overall score falls below EVAL_THRESHOLD, the build goes red and the PR cannot merge. Printing the failing cases first means whoever broke it does not have to rerun anything locally to see what went wrong, the diff is right there in the log.

One caution: this gate makes a network call to Claude on every run, so a flaky network or a rate limit can fail the build for reasons that have nothing to do with your code. I would not put the judge-based gate on every commit. Run the deterministic scorers everywhere, and run the full judge-backed gate on PRs to main or on a nightly schedule. More on that next.

Gotchas

Everything above works. Here is what bites you once it is in production.

Non-determinism. The same input does not always produce the same output, so the same input does not always produce the same score. A case that scores 0.9 today might score 0.8 tomorrow with no code change. This is why a hard threshold of 0.85 is dangerous when your real score hovers around 0.86, you will get random red builds. Either leave a margin between your real score and the gate, or run flaky cases a few times and average. Do not pretend the number is more stable than it is.

Judge bias. LLM judges have tells. They tend to reward longer, more confident, more verbose answers even when a terse one is correct. They favor outputs that look like their own writing. They are lenient when the rubric is vague and erratic when it is long. The mitigations are unglamorous: keep rubrics short and concrete, constrain the output to a number, and, when it matters, spot-check the judge against human labels on a sample. If the judge and a human disagree more than occasionally, the judge score is decorative. Treat it as a signal, not a verdict.

Dataset drift. Your dataset is a snapshot of the inputs you thought of when you wrote it. Real traffic moves. New phrasings, new edge cases, new failure modes appear that your three (or three hundred) cases never covered, and your eval score stays a cheerful 0.92 while users hit walls. The only fix is feeding the dataset from reality, sample real inputs periodically, label them, add them. An eval suite that never grows is slowly becoming a lie.

Cost. Every judge call is an API call. A 200-case dataset with one judge scorer is 200 calls per run, and if CI runs the full suite on every push across a busy repo, that adds up in both money and time. Keep the judge off the hot path. Deterministic scorers are free, run those on every commit. Gate on the judge less often: PRs to main, or nightly. And keep the judge on a mid-tier model; you are not paying for the smartest grader, you are paying for a consistent one.

Flaky thresholds. This is the one that erodes trust fastest. If the gate fails for random reasons, network blips, judge variance, a rate limit, people stop believing it, and a few "just re-run it" merges later the gate is decoration. A gate you override is worse than no gate, because it gives you false confidence. Make the threshold honest: set it below where you actually sit, retry transient API failures rather than scoring them as zero, and when it goes red, it should mean something is genuinely worse.

Wrapping up

What you have is small on purpose. A dataset, a few scorers, a judge for the fuzzy bits, a runner, and an exit code. None of it is clever. That is the feature, you can read the whole thing, you can debug it at 2am, and you can extend it without learning a framework.

The payoff is not the code, it is the habit. Once there is a number that moves when you change a prompt, you stop arguing about whether an edit helped and start measuring it. You turn every bug report into a test case instead of a memory. And the next time someone wants to bump the model, you run npm run eval and find out in two minutes whether it is actually better, instead of finding out from a customer in two weeks.

Start with three cases. Add the next one the day something breaks. The harness grows itself from there.