
Estimating Timeline and Budget for an AI Build
Why AI projects blow past estimates, and a framework for scoping timeline and budget that accounts for evals, data work, and the iteration AI actually needs.
Key takeaways
- AI estimates slip because quality is iterative, data is messier than admitted, evals take real time, and the last 10 percent of accuracy is harder than the first 90.
- Discovery and hardening together are often half the total effort, and they are the two phases teams are most tempted to cut from optimistic estimates.
- The accuracy bar is the single biggest cost lever, since wrong almost never can cost two or three times more than helpful most of the time.
- Treat an exact fixed price quoted before anyone looks at the data as a guess, and ask for a range with the assumptions written next to it.
If someone hands you a fixed timeline and a fixed price for an AI build before any data has been looked at, they are guessing. They might be guessing well, but it is still a guess. The honest answer to "how long and how much" is a range that gets tighter as you learn things you cannot know up front: how messy your data actually is, how good "good enough" has to be, and how many iterations it takes to close the gap between a demo that impresses people and a system you can put in front of customers.
This post is the framework I use to scope that range. It is built for a CTO or founder who has a budget number in their head and wants to know whether it is realistic. No code to speak of, just where the time goes and what moves the number.
Why AI estimates slip
Traditional software estimation works because the work is mostly knowable. You can decompose a CRUD app into screens and endpoints, and a senior engineer can size each one within reason. The variance is real but bounded.
AI builds break that assumption in four specific ways.
Quality is iterative, not built once. With a normal feature, you write it, it works, you move on. With an AI feature, you build a first version, measure how often it is right, and then spend most of your effort dragging that number up. The first version is rarely the deliverable. It is the baseline you improve from.
Data is messier than anyone admits. Almost every estimate I have seen go wrong went wrong here. The data exists, sure, but it is in three formats across two systems, half of it is unlabeled, and the labels that do exist were written by people who disagreed with each other. You do not find this out from a kickoff call. You find it out a week in, when someone actually opens the files.
Evals take real time to build, and you cannot skip them. An eval is the test suite for an AI system: a set of cases with known-good answers that tells you whether a change made things better or worse. Without it you are flying blind, tuning prompts on vibes. Building a trustworthy eval set is itself a chunk of the project, and it is the chunk people most want to cut to save time. Cutting it does not save time. It just moves the cost to later, when you are debugging a regression in production with no way to reproduce it.
The last 10 percent of accuracy is harder than the first 90. Getting a model from nothing to 80 percent correct can feel fast. Going from 80 to 95 is where the weeks go, because the remaining failures are the weird ones: the edge cases, the ambiguous inputs, the rare categories. Each point of accuracy past a certain line costs more than the one before it. If your use case genuinely needs 99 percent, that is a different project than one that is fine at 85, and the difference is mostly time.
The framework: four phases
I break every AI build into four phases. The point of separating them is that they have different risk profiles and different ways of going wrong, so you want to budget and re-estimate at each boundary rather than committing to the whole thing at once.
Discovery and evals
This is where you look at the real data, define what "working" means in numbers, and build the eval set that will tell you whether you are getting there. You also build a quick throwaway version of the core capability, mostly to find out what is hard. People underinvest here because it produces no shippable feature, but it is the phase that de-risks everything after it. A good discovery phase is what lets you give a tighter estimate for the rest.
Build
The main construction work: pipelines, the model integration or fine-tuning, the application around it, the plumbing. This is the part that looks most like normal software, and it is the part that estimates the most reliably. If discovery was done properly, the build phase holds few surprises.
Hardening
Making it survive contact with reality. Latency, cost per request, error handling for when the model returns nonsense, guardrails, monitoring, the fallback path for when the AI is not confident. A prototype skips almost all of this. A production system lives or dies on it. This phase is the single biggest difference between a demo and a product, and it is the one most often left out of optimistic estimates.
Iteration
After it is live, you watch real usage, find the failure patterns you did not anticipate, and improve against them. This does not really end, but the intense part, the push from "launched" to "actually good," is a budgetable block of work. Treat it as part of the project, not as a vague afterthought.
Where the effort goes
Here is roughly how the effort splits across a typical production build. Shares are of total engineering effort, not calendar time, and they shift with how clean your data is and how high the accuracy bar sits.
| Phase | What happens | Typical share of effort |
|---|---|---|
| Discovery and evals | Inspect real data, define success metrics, build the eval set, throwaway prototype of the core capability | 20-30% |
| Build | Pipelines, model integration or tuning, the surrounding application | 25-35% |
| Hardening | Latency, cost, guardrails, fallbacks, monitoring, edge-case handling | 20-30% |
| Iteration | Measure live behavior, fix real failure patterns, push accuracy to the target | 15-25% |
A few things worth noticing. Discovery and hardening together are often half the work, and they are the two phases people are most tempted to cut. The build phase, the part that feels like "the project," is rarely the majority. And the shares move: dirty data inflates discovery, a strict accuracy requirement inflates iteration, and a high-traffic or regulated use case inflates hardening.
Prototype timeline versus production timeline
These are different projects, and conflating them is how budgets get set wrong.
A prototype answers one question: can this work at all? It runs on a clean slice of data, skips most hardening, and is allowed to fail on edge cases because nobody depends on it yet. You are buying a yes-or-no answer and an informed estimate for the real thing. For a well-defined problem, a prototype is usually weeks, not months. If you want to validate the idea and the path before committing real budget, that is exactly what our AI MVP development work is for: a fast, honest read on feasibility.
A production system is everything the prototype skipped, plus the iteration to get accuracy where it has to be. Realistic data, the full hardening phase, monitoring, and the push from "it works in the demo" to "it works for the customer who is trying to break it." This is months, and the spread is wide because it depends almost entirely on the accuracy bar and the data.
The trap is quoting a production timeline using prototype assumptions. The prototype hit 85 percent on clean data in three weeks, so surely production is, what, double that? No. The production version has to handle the messy 15 percent, survive load, and not embarrass you when the model hallucinates. That is where the months come from.
What actually moves the number
When I give a range, these are the variables I am pricing in. If you want to make your build cheaper and faster, this is the list to negotiate against.
- The accuracy bar. The single biggest lever. "Helpful most of the time" and "wrong almost never" are different budgets, sometimes by a factor of two or three. Be honest about what your use case truly requires, because aiming higher than you need is one of the most expensive mistakes available.
- Data readiness. Clean, labeled, accessible data can cut the discovery phase in half. Scattered, unlabeled, or locked-up data can double it. Find out which one you have before you commit to anything.
- How well-defined the task is. "Classify these support tickets into ten known categories" is tractable. "Make our product smarter with AI" is not a spec, and the time spent turning it into one is part of the cost.
- Tolerance for being wrong. A system where a mistake is mildly annoying needs far less hardening than one where a mistake loses money or breaks trust. The stakes of a wrong answer set the size of the guardrail work.
- Integration surface. A standalone tool is faster than something wired into five existing systems, each with its own auth, data format, and team you have to coordinate with.
If you are early enough that these answers are still fuzzy, that fuzziness is the finding. Resolving it is the first work to do, and it is cheaper to do deliberately than to discover it mid-build. That kind of upfront scoping is most of what AI consulting is good for: turning a vague ambition into a problem with a number attached.
How to read any estimate you are given
So when a vendor or your own team hands you a figure, here is how to sanity-check it.
Ask what accuracy target it assumes, and what happens to the number if the target moves. Ask whether anyone has looked at the actual data yet, or whether the estimate assumes the data is clean. Ask where evals and hardening sit in the plan, because if they are missing, the estimate is for a prototype wearing a production label. And ask for a range with the assumptions written next to it, not a single number, because a single number is either padded heavily or about to be wrong.
A good estimate is not a promise of precision. It is an honest map of what is known, what is not, and what the unknowns could cost. The team that gives you a range and tells you what would tighten it is more trustworthy than the one that gives you an exact figure with a confident face. I would budget for the range, plan to re-estimate after discovery, and treat anyone quoting an exact number up front with friendly suspicion.