
Shipping Generative AI Features Without Breaking Production
A field guide to taking LLM-powered features from a flashy demo to a system you can trust on a Friday afternoon.
Key takeaways
- Version your prompts and review prompt changes in pull requests, because a one-word edit can break a feature like a bad commit.
- Build an eval harness of 30 to 50 real inputs before the feature, and run it on every prompt or model change.
- Assume the model will be wrong: use structured outputs, add a cheap validation check, and fall back to a human or safe default when confidence is low.
- Track token usage from day one, cache aggressively, and route the easy 80 percent of requests to a smaller model.
Overview
Demos are easy. A prompt, a model, a nice UI, and a room full of nodding heads. The hard part starts after the applause, when latency, cost, the occasional hallucination, and the general moodiness of language models all show up at once and meet real users.
We build LLM systems for a living. After enough 2am incidents you stop trusting the demo and start trusting a checklist. Here's ours.
Treat the prompt as code, not a sticky note
The most common mistake I see: the prompt lives in a Slack thread, or it's hardcoded three layers deep in some controller. That prompt is some of the most important logic you have. Treat it that way.
Version it, so every change is a deploy you can tag and roll back. Keep the static instructions separate from the dynamic context, so you can reason about each part on its own. And put prompt edits in the PR. A one-word change can break a feature just as badly as a bad commit, and it's a lot harder to spot later if nobody reviewed it.
Build the eval harness before you build the feature
"It looked good when I tried it" is not a measurement. Before you write the user-facing part, write the thing that tells you whether it actually works.
Grab 30–50 real inputs, and not just the happy path. You want the messy ones your users really send. Then decide what "correct" even means for each one. Sometimes that's an exact match, sometimes it's closeness in meaning, sometimes you hand it to a bigger model with a rubric and let it grade. Run that set on every prompt or model change and watch the trend line instead of the demo.
The teams that win at this aren't the ones with the cleverest prompts. They're the ones with the shortest gap between making a change and knowing whether it helped.
Assume the model will be wrong
It's probabilistic. Plan for the bad tail, not the average case.
Pin the output down wherever you can; structured outputs and tool schemas turn "I hope it returns valid JSON" into something you can rely on. Put a cheap check behind it too, either a smaller model or plain old validation, so the obvious garbage gets caught before a user ever sees it. And when confidence is low, hand off to a human or fall back to a safe default. A confident wrong answer does more damage than no answer.
Watch the bill
A feature that costs forty cents a call is a science experiment, not a product. Track token usage from day one. Cache hard. Send the easy 80% of requests to a small model and save the expensive one for the work that genuinely needs it.
The short version
Most of this is boring engineering pointed at a weird, unpredictable component. Version the prompts. Measure constantly. Assume it'll be wrong and build for that. Keep an eye on the cost. The clever part is real, but it falls over in production unless the unglamorous stuff underneath is solid.