
How to Evaluate an AI Development Partner
A practical checklist for CTOs vetting an AI development agency, covering evals, IP, security, cost, and the questions that separate builders from demo shops.
Key takeaways
- Ask a vendor how they know the thing works, and listen for an eval suite, a measured pass rate, and a story about a failure they caught and fixed.
- Get clean IP ownership in writing for code, prompts, fine-tuned weights, and eval datasets, since vendors sometimes treat prompts and eval sets as their property.
- Demand total cost of ownership including recurring inference, because a low build quote that assumes a frontier model on every call hides an expensive operation.
- Run a small, fairly paid two-to-four-week trial with a real deliverable before signing, so you judge them on evidence rather than a sales call.
The fastest way to separate a real AI development partner from a demo shop is to ask one question: how do you know the thing works? If the answer is a slick video and a confident "it just does," walk away. If the answer involves an eval suite, a measured pass rate, and a story about the cases where it failed and what they changed, you are talking to people who ship. Everything else on this page is detail, but that one question carries most of the weight.
I run an AI development firm, so you should read this with the appropriate skepticism. I am not a neutral party. But I have also sat on the other side of the table as the buyer, and I have inherited enough half-finished AI projects from other vendors to know exactly where these engagements go wrong. What follows is the checklist I would use if I were the CTO doing the vetting.
Why AI vendors are harder to evaluate than normal dev shops
A regular web agency is easy to judge. You look at sites they built, you click around, the work either holds up or it doesn't. AI is different because the demo and the product live in completely separate worlds.
Anyone with an API key can build a demo that looks like magic. Wire up a model, cherry-pick three happy-path inputs, record the screen. That demo tells you almost nothing about whether the system holds up on the messy, adversarial, high-volume reality of production. The gap between "works in the demo" and "works for real users at 2am with weird input" is where most AI projects quietly die, and it is invisible during a sales call unless you know what to probe for.
So the evaluation is less about "can they make something impressive" and more about "do they have the discipline to make something that stays correct when nobody is watching." Those are different skills, and a lot of vendors only have the first one.
The seven things that actually matter
Here is the short version. The rest of the post unpacks each one.
- They measure quality with evals, not vibes
- The contract gives you clean ownership of all IP and model artifacts
- They have a real answer for security and data handling
- Senior engineers do the work, not just the sales call
- Cost is transparent, including the part that recurs forever (inference)
- They have shipped AI to production before, with references who will say so
- Communication and timezone overlap actually fit your team
If a vendor scores well on five of these but waves their hands at the other two, that is usually fine. If they fail the eval question or the IP question, nothing else matters.
Evals: how do they know it works?
This is the single highest-signal area, so spend the most time here.
A team that builds AI seriously has an eval harness. That means a dataset of representative inputs, a defined notion of "correct" for each (or a graded rubric for the fuzzy ones), and a way to run a candidate system against the whole set and get a number back. When they change a prompt or swap a model, they re-run the suite and see whether the number went up or down. Without this, every change is a guess, and "we improved it" is just a feeling.
Ask them to walk you through their eval setup on a past project. You are listening for specifics: how big was the dataset, how did they decide what counted as a pass, did they use an LLM-as-judge for the subjective cases and how did they keep that judge honest, what was the pass rate when they shipped, and how did they catch regressions. A vendor who does this will light up at the question because it is the part of the job they are proud of. A vendor who does not will give you a vague answer about "extensive testing."
The follow-up that catches people: ask about a time the evals caught something they would have shipped otherwise. Real practitioners have that story instantly. It is the AI equivalent of a good engineer telling you about a bug their tests caught.
IP ownership: read the contract before you fall in love
This one is boring and it is where companies get burned the hardest.
You want, in writing, that you own all of it: the code, the prompts, the fine-tuned weights if any, the eval datasets, and the training data derived from your inputs. Prompts and eval sets are real assets. A polished system prompt and a well-curated eval suite can be more valuable than the application code around them, and some vendors quietly treat those as their reusable property.
Watch for two specific traps. First, "we retain rights to our framework" can be reasonable if the framework is genuinely a general tool, or it can be a way to keep the actual brains of your product. Make them draw the line precisely. Second, check what happens to data you send to third-party model providers. If your customer data flows through an API, you need to know whether the provider can train on it, and the contract should forbid that path entirely for anything sensitive.
| Area | Green flag | Red flag |
|---|---|---|
| Code and prompts | You own everything, stated explicitly | "Standard agency terms," details unclear |
| Eval datasets | Handed over as a deliverable | Treated as vendor's proprietary asset |
| Fine-tuned models | Weights and configs are yours | "We host it for you" with no exit path |
| Third-party data flow | Zero-retention terms, named providers | Vague about where data goes |
If the contract is squishy on ownership, that is not a paperwork problem you can fix later. Fix it before signing or do not sign.
Security and data handling
The questions here are the same ones you would ask any vendor handling your data, plus a few that are specific to AI.
The AI-specific risks are worth naming. Prompt injection is real: if your system takes untrusted input and feeds it to a model that can call tools or access data, an attacker can try to hijack it. Ask how they think about that. Data leakage through model providers is the other big one, covered above. And if they are using retrieval over your documents, ask how access control works, because a naive RAG setup will happily surface a document to a user who should never have seen it.
Run through the basics too. Where does data live, who can see it, what happens to it when the engagement ends, do they have SOC 2 or are they willing to work under your security requirements. If you are in a regulated space, this is not optional and the vendor should already be fluent in it. This is the kind of judgment a good AI consulting engagement should surface early, before a line of production code exists.
Senior versus junior staffing
The classic agency move is to sell you the principal engineer and staff the project with juniors the moment the ink dries. In AI this hurts more than usual, because the field is young and the depth of experience varies wildly. Someone who has shipped three production AI systems and watched them fail in interesting ways is worth a great deal more than someone who finished an LLM course last quarter.
Ask who specifically will do the work, by name, and ask to talk to them, not just the salesperson. Ask how long they have worked with this technology. Ask what they think is overhyped right now, because anyone with real experience has strong, specific opinions about what does not work, and the demo-makers only have enthusiasm.
If you are staffing an internal team rather than buying a project outcome, the same logic applies to dedicated AI developers: you want to interview the actual humans and see their work, not accept a generic "senior engineer" slot on a rate card.
Cost transparency, including the part that never stops
AI has a cost structure most software does not: the bill keeps coming after the build is done. Every inference call costs money, and at scale that recurring cost can dwarf the development fee. A vendor who only talks about the build price and goes quiet on running costs is either inexperienced or hiding the ball.
A good partner models the ongoing cost for you up front. They estimate tokens per request, requests per day, the model tier you actually need, and they tell you roughly what your monthly inference bill looks like at your expected volume. They also tell you how to bring it down: caching, a smaller model for the easy cases, batching, the obvious levers. If they have not thought about this, they have not run anything at scale.
Watch for the trap of a low build quote that assumes an expensive frontier model for every call forever. The cheap project becomes the expensive operation. Get the total cost of ownership in writing, not just the invoice for the build.
Can they actually ship to production?
Production is a different animal from a demo, and the difference is mostly the unglamorous parts.
Ask what their AI systems do when the model API is down, when it returns garbage, when a request times out, when a user sends something adversarial. Ask about monitoring: how do they know in production whether quality is degrading, since a model can get quietly worse without throwing a single error. Ask about cost controls, rate limits, and what happens when usage spikes. These questions sort the builders from the demo-makers faster than almost anything else, because handling them is tedious work that only matters once real users show up.
| Capability | Demo-maker | Production builder |
|---|---|---|
| Failure handling | "It works" | Fallbacks, retries, graceful degradation |
| Monitoring | None | Tracks quality and cost in production |
| Eval discipline | Manual spot checks | Automated suite, run on every change |
| Cost awareness | Build price only | Models ongoing inference cost |
| Adversarial input | Not considered | Tested and defended |
References, and how to use them
Always take references, and do not let the vendor hand you only the cheerleaders. Ask for a reference where something went wrong. Every real project has one, and how the vendor handled trouble tells you more than ten happy stories.
On the call, get specific. Did the system actually make it to production and stay there, or did it quietly get shelved after the press release. Did costs land where they were quoted. How were the engineers when things got hard. Would they hire the vendor again for the next project, which is the only reference question that really counts.
Communication and timezone
This is mundane and it kills more engagements than technical failure does. Decide what overlap your team actually needs. A few hours of real-time overlap per day is usually enough for a well-run async relationship, but you have to be honest about whether your team can work async at all. Some can. Some need someone online when they are.
Ask how they communicate by default, how often you will see working software (the answer should be measured in days, not months), and who your single point of contact is when something breaks. The cadence of demos matters: if they cannot show you running software every week or two, you have no way to course-correct until it is too late to course-correct cheaply.
Run a paid trial before you commit
Here is the part most people skip, and it is the best protection you have. Do not sign a six-month contract off a sales call. Run a small, paid first engagement, two to four weeks, with a real deliverable, and judge them on that.
Pick a slice of the real problem, not a toy. It should be small enough to finish quickly but real enough that doing it well requires the skills the big project needs. The deliverable should run, not just demo: a working endpoint, an eval report with actual numbers, and a short writeup of what they would do differently at full scale.
Pay them properly for it. A cheap trial attracts vendors who treat it as a loss leader and staff it accordingly. A fairly paid trial gets you their real team and their real work, which is exactly what you are trying to evaluate.
When the trial ends, you have what no sales process can give you: evidence. You have seen their code, their eval numbers, how they communicate under a deadline, and whether the people who showed up are the people they sold you. If the trial went well, scale up with confidence. If it did not, you spent a few weeks and a modest budget to dodge a six-month mistake, which is the cheapest insurance in this entire process.