Insights // Production2026-04-039 min read

The AI Readiness Checklist Before You Ship to Customers

A pre-launch checklist for AI features: evals, guardrails, cost controls, monitoring, fallbacks, and the failure modes that embarrass you in production.

Varun Raj ManoharanFounder & Principal Engineer

ProductionAI StrategyLaunchCTO

Key takeaways

A feature is ready when it fails safely and visibly, not when the model is always right; silent expensive failures mean it is not ready.
Build an eval set of 30 to 100 real inputs with a written pass threshold that runs automatically on every prompt, model, and dependency change.
Guardrails are code that runs after the model and can reject malformed output, not a prompt that says be safe; irreversible actions need a confirmation step.
Log token cost per request tagged by feature and customer, set budget alerts, and cap retries and agent steps so one user cannot run up the bill.

What "ready" actually means

Ready does not mean the model is always right. It never will be. Ready means that when the model is wrong, your system catches it, contains it, or degrades into something harmless, and you find out before your customer does. If a feature can fail safely and you can see it failing, it is ready. If it can fail silently and expensively, it is not, no matter how good the demo looked.

This is the checklist a feature goes through here before any real customer touches it. It is not about chasing perfection. It is about making the failure modes boring instead of scary. Run through each group below. If you cannot check a box, you have a decision to make: fix it, scope around it, or ship with eyes open and a written reason why.

Quality you can actually measure

You cannot improve what you only judge by vibes. "It looked good when I tried it" is a feeling, not a quality bar. Before launch you need a fixed set of inputs, a definition of correct for each one, and a number you refuse to ship below.

An eval set of 30 to 100 real inputs exists, including the ugly edge cases your users will actually send
Each input has a defined notion of correct (exact match, semantic closeness, or a graded rubric)
A pass threshold is written down and agreed on, not invented after the run
The eval runs automatically on every prompt change, model change, and dependency bump
Someone other than the author has reviewed a sample of outputs by hand

The point of the eval set is not to prove the feature is perfect. It is to give you a trend line, so the next change is a measured step instead of a guess. If you want help standing this up across a portfolio of features, this is core to how we approach enterprise AI development.

Guardrails and output constraints

The model will eventually produce something you did not plan for. Your job is to make sure that something cannot reach a customer unchecked, leak into a system that trusts it, or get executed as if it were safe.

Outputs are constrained to a schema (structured output or tool calls) wherever the shape matters
A validation layer rejects malformed or out-of-range responses before they are used
Inputs are checked for prompt injection on any flow that touches tools, data, or other users
The model cannot trigger irreversible actions (payments, deletes, emails) without a confirmation step
There is a hard cap on output length and on the number of tool calls per request
Sensitive topics or out-of-scope requests have a defined refusal behavior

A guardrail is not a vibe-check prompt that says "be safe." It is code that runs after the model and can say no.

A human fallback for low confidence

Most teams build the happy path and forget the off-ramp. The off-ramp is what keeps a confident wrong answer from becoming a customer-facing mistake. A confident wrong answer does more damage than no answer.

Low-confidence cases route to a human or to a safe default instead of guessing
There is a clear signal for "the model is unsure" (low score, failed validation, ambiguous input)
The fallback path is tested, not theoretical, and someone owns the queue it feeds
Customers can reach a human when the AI cannot help, and that path is obvious

If your feature has no fallback, it is not autonomous. It is just unsupervised.

Cost instrumentation and budget alerts

LLM features have a variable cost that scales with usage and, worse, with abuse. A retry loop or a single malicious user can turn a comfortable bill into a panicked Slack message. You want to know your unit economics before launch, not after the invoice.

Token usage and cost are logged per request, tagged by feature and customer
You know the cost per successful outcome, not just the cost per call
A budget alert fires when daily or hourly spend crosses a threshold
Per-user and per-tenant rate limits are in place so one account cannot run up the bill
Runaway loops (retries, agent steps) have a hard ceiling

If you cannot answer "what does one customer interaction cost us," you are not ready to price it or scale it.

Monitoring, tracing, and prompt versioning

When something goes wrong at 2am, you need to reconstruct what happened: the input, the prompt version, the model response, the tool calls, the final output. If that trail does not exist, every incident becomes an archaeology project.

Every request is traced end to end (input, prompt version, model, tools, output)
Latency, error rate, and fallback rate are on a dashboard someone watches
Prompts are versioned in source control, and the running version is recorded with each request
You can roll a prompt back as fast as you can roll back code
There is an alert for quality regressions, not just for crashes (a working endpoint can still be giving bad answers)
Personally identifiable data in logs is masked or excluded

The hardest production failures are the quiet ones, where nothing throws an error and the answers just slowly get worse. This is the part most teams skip, and it is the heart of what we mean by MLOps and AgentOps.

Data privacy and the legal review

Shipping AI to US customers means you are now making promises about where their data goes and what is done with it. Get this wrong and the cost is not a bug ticket, it is a breach notice.

You know which provider sees customer data and whether it is used for training (it should not be)
A data processing agreement is in place with each model provider
Sensitive fields are redacted or tokenized before they leave your perimeter
Retention of prompts and outputs is defined, and deletion actually works
Your privacy policy and terms reflect that AI is in the loop
If you serve regulated data (health, financial, EU residents), the relevant review is signed off

This is the one group where "we will fix it later" is genuinely dangerous. Do it before launch.

A plan for when it is wrong

It will be wrong. The question is whether being wrong is a shrug or a crisis. Decide that now, in calm conditions, instead of improvising it during an incident.

Customers can flag a bad response in one click, and those flags land somewhere you read
There is a kill switch to disable the feature without a deploy
You can roll back to the previous prompt or model in minutes
An on-call owner is named, and the runbook says what to do for the common failures
Bad outputs feed back into the eval set so the same mistake is caught next time

Common failure modes and what to do about them

These are the ones that show up again and again. None are exotic. All are cheaper to prevent than to explain to a customer.

Failure mode	What it looks like	Mitigation
Confident hallucination	A fluent, wrong answer stated as fact	Ground responses in retrieved data, cite sources, add a verification check
Prompt injection	User input hijacks instructions or exfiltrates data	Separate instructions from data, validate tool calls, never trust model output as a command
Malformed output	Downstream code breaks on bad JSON	Structured outputs plus a validation layer that retries or falls back
Cost blowout	A loop or abusive user spikes the bill	Per-user limits, step ceilings, budget alerts
Latency spikes	Slow responses under load	Timeouts, streaming, a smaller fast model for simple cases
Silent quality drift	Answers get worse after a provider update	Continuous evals and alerts on quality, not just on errors
Stale or wrong context	The model answers from old data	Freshness checks on retrieval, cache invalidation, show the data's date
Privacy leak	Customer data lands in logs or training	Redact before logging, no-training agreements, retention limits

How to use this

Do not treat the boxes as bureaucracy. Treat them as a conversation with your future self at 2am. Walk a real feature through each group with the engineers who built it and the person who will be on call for it. Where you cannot check a box, write down why, and decide out loud whether that gap is acceptable for this launch.

You will not check every box for every feature, and you should not pretend otherwise. A low-stakes internal summarizer needs less than a feature that emails your customers. The work is matching the rigor to the blast radius, and being honest about the difference. Ready is not perfect. Ready is safe to fail.