
The AI Readiness Checklist Before You Ship to Customers
A pre-launch checklist for AI features: evals, guardrails, cost controls, monitoring, fallbacks, and the failure modes that embarrass you in production.
Key takeaways
- A feature is ready when it fails safely and visibly, not when the model is always right; silent expensive failures mean it is not ready.
- Build an eval set of 30 to 100 real inputs with a written pass threshold that runs automatically on every prompt, model, and dependency change.
- Guardrails are code that runs after the model and can reject malformed output, not a prompt that says be safe; irreversible actions need a confirmation step.
- Log token cost per request tagged by feature and customer, set budget alerts, and cap retries and agent steps so one user cannot run up the bill.
What "ready" actually means
Ready does not mean the model is always right. It never will be. Ready means that when the model is wrong, your system catches it, contains it, or degrades into something harmless, and you find out before your customer does. If a feature can fail safely and you can see it failing, it is ready. If it can fail silently and expensively, it is not, no matter how good the demo looked.
This is the checklist a feature goes through here before any real customer touches it. It is not about chasing perfection. It is about making the failure modes boring instead of scary. Run through each group below. If you cannot check a box, you have a decision to make: fix it, scope around it, or ship with eyes open and a written reason why.
Quality you can actually measure
You cannot improve what you only judge by vibes. "It looked good when I tried it" is a feeling, not a quality bar. Before launch you need a fixed set of inputs, a definition of correct for each one, and a number you refuse to ship below.
- An eval set of 30 to 100 real inputs exists, including the ugly edge cases your users will actually send
- Each input has a defined notion of correct (exact match, semantic closeness, or a graded rubric)
- A pass threshold is written down and agreed on, not invented after the run
- The eval runs automatically on every prompt change, model change, and dependency bump
- Someone other than the author has reviewed a sample of outputs by hand
The point of the eval set is not to prove the feature is perfect. It is to give you a trend line, so the next change is a measured step instead of a guess. If you want help standing this up across a portfolio of features, this is core to how we approach enterprise AI development.
Guardrails and output constraints
The model will eventually produce something you did not plan for. Your job is to make sure that something cannot reach a customer unchecked, leak into a system that trusts it, or get executed as if it were safe.
- Outputs are constrained to a schema (structured output or tool calls) wherever the shape matters
- A validation layer rejects malformed or out-of-range responses before they are used
- Inputs are checked for prompt injection on any flow that touches tools, data, or other users
- The model cannot trigger irreversible actions (payments, deletes, emails) without a confirmation step
- There is a hard cap on output length and on the number of tool calls per request
- Sensitive topics or out-of-scope requests have a defined refusal behavior
A guardrail is not a vibe-check prompt that says "be safe." It is code that runs after the model and can say no.
A human fallback for low confidence
Most teams build the happy path and forget the off-ramp. The off-ramp is what keeps a confident wrong answer from becoming a customer-facing mistake. A confident wrong answer does more damage than no answer.
- Low-confidence cases route to a human or to a safe default instead of guessing
- There is a clear signal for "the model is unsure" (low score, failed validation, ambiguous input)
- The fallback path is tested, not theoretical, and someone owns the queue it feeds
- Customers can reach a human when the AI cannot help, and that path is obvious
If your feature has no fallback, it is not autonomous. It is just unsupervised.
Cost instrumentation and budget alerts
LLM features have a variable cost that scales with usage and, worse, with abuse. A retry loop or a single malicious user can turn a comfortable bill into a panicked Slack message. You want to know your unit economics before launch, not after the invoice.
- Token usage and cost are logged per request, tagged by feature and customer
- You know the cost per successful outcome, not just the cost per call
- A budget alert fires when daily or hourly spend crosses a threshold
- Per-user and per-tenant rate limits are in place so one account cannot run up the bill
- Runaway loops (retries, agent steps) have a hard ceiling
If you cannot answer "what does one customer interaction cost us," you are not ready to price it or scale it.
Monitoring, tracing, and prompt versioning
When something goes wrong at 2am, you need to reconstruct what happened: the input, the prompt version, the model response, the tool calls, the final output. If that trail does not exist, every incident becomes an archaeology project.
- Every request is traced end to end (input, prompt version, model, tools, output)
- Latency, error rate, and fallback rate are on a dashboard someone watches
- Prompts are versioned in source control, and the running version is recorded with each request
- You can roll a prompt back as fast as you can roll back code
- There is an alert for quality regressions, not just for crashes (a working endpoint can still be giving bad answers)
- Personally identifiable data in logs is masked or excluded
The hardest production failures are the quiet ones, where nothing throws an error and the answers just slowly get worse. This is the part most teams skip, and it is the heart of what we mean by MLOps and AgentOps.
Data privacy and the legal review
Shipping AI to US customers means you are now making promises about where their data goes and what is done with it. Get this wrong and the cost is not a bug ticket, it is a breach notice.
- You know which provider sees customer data and whether it is used for training (it should not be)
- A data processing agreement is in place with each model provider
- Sensitive fields are redacted or tokenized before they leave your perimeter
- Retention of prompts and outputs is defined, and deletion actually works
- Your privacy policy and terms reflect that AI is in the loop
- If you serve regulated data (health, financial, EU residents), the relevant review is signed off
This is the one group where "we will fix it later" is genuinely dangerous. Do it before launch.
A plan for when it is wrong
It will be wrong. The question is whether being wrong is a shrug or a crisis. Decide that now, in calm conditions, instead of improvising it during an incident.
- Customers can flag a bad response in one click, and those flags land somewhere you read
- There is a kill switch to disable the feature without a deploy
- You can roll back to the previous prompt or model in minutes
- An on-call owner is named, and the runbook says what to do for the common failures
- Bad outputs feed back into the eval set so the same mistake is caught next time
Common failure modes and what to do about them
These are the ones that show up again and again. None are exotic. All are cheaper to prevent than to explain to a customer.
| Failure mode | What it looks like | Mitigation |
|---|---|---|
| Confident hallucination | A fluent, wrong answer stated as fact | Ground responses in retrieved data, cite sources, add a verification check |
| Prompt injection | User input hijacks instructions or exfiltrates data | Separate instructions from data, validate tool calls, never trust model output as a command |
| Malformed output | Downstream code breaks on bad JSON | Structured outputs plus a validation layer that retries or falls back |
| Cost blowout | A loop or abusive user spikes the bill | Per-user limits, step ceilings, budget alerts |
| Latency spikes | Slow responses under load | Timeouts, streaming, a smaller fast model for simple cases |
| Silent quality drift | Answers get worse after a provider update | Continuous evals and alerts on quality, not just on errors |
| Stale or wrong context | The model answers from old data | Freshness checks on retrieval, cache invalidation, show the data's date |
| Privacy leak | Customer data lands in logs or training | Redact before logging, no-training agreements, retention limits |
How to use this
Do not treat the boxes as bureaucracy. Treat them as a conversation with your future self at 2am. Walk a real feature through each group with the engineers who built it and the person who will be on call for it. Where you cannot check a box, write down why, and decide out loud whether that gap is acceptable for this launch.
You will not check every box for every feature, and you should not pretend otherwise. A low-stakes internal summarizer needs less than a feature that emails your customers. The work is matching the rigor to the blast radius, and being honest about the difference. Ready is not perfect. Ready is safe to fail.