Insights // Cost2026-05-0110 min read

What It Really Costs to Run an LLM Feature in Production

Token bills are only part of it. The true cost of an LLM feature: inference, retrieval infra, evals, monitoring, and the engineering time nobody budgets for.

V
Varun Raj ManoharanFounder & Principal Engineer
CostLLMProductionCTO

Key takeaways

  • Budgeting an LLM feature off the per-token price alone underestimates the real cost by a factor of three to ten.
  • Engineering time usually dominates the bill at 40 to 70 percent, since someone has to build the feature and keep it working as prompts drift and models get deprecated.
  • Budget the unit, like one resolved conversation, then multiply by an honest volume forecast instead of guessing a lump sum.
  • Prompt caching can cut input cost by roughly 90 percent on the stable portion of a prompt, which is the highest-leverage optimization for high-traffic features.

If you are budgeting an LLM feature off the per-token price on the provider's pricing page, your number is wrong by a factor of three to ten. The tokens are real, but they are usually the smallest line item once the thing is actually in front of users. The bills that surprise people are the vector database that has to stay up, the eval suite someone has to write and keep current, the observability stack you bolt on after the first bad weekend, and the engineering salaries that quietly dwarf all of it.

This post is the costing framework I wish someone had handed me before my first production LLM project. It is not a tutorial and there is almost no code. It is a way to build a defensible monthly number, with a worked example for a customer support assistant, plus a back-of-envelope formula you can run in your head during a planning meeting.

Start with the unit, not the total

The mistake I see most often is budgeting the feature as a lump sum. "How much will the AI thing cost?" has no answer. "How much does one request cost, and how many requests do we expect?" has an answer you can defend.

So the first move is to define the unit. For a support assistant, the unit is one resolved conversation. For a summarizer, it is one document. For a code reviewer, it is one pull request. Everything downstream gets cheaper to reason about once you have picked the unit, because inference cost scales with it, and most of the other costs do not.

That split matters. Inference is variable cost: it grows linearly with usage. Almost everything else (retrieval infra, evals, monitoring, the team) is closer to fixed cost: you pay it whether you serve a thousand requests or a million. The whole game of LLM economics is figuring out where your usage volume sits relative to that fixed base.

The cost categories

Here are the six categories I budget for. The first scales per request. The rest mostly do not.

CategoryScales withEasy to forget?Typical share of monthly cost
Per-request inferenceUsage (linear)No15-40%
Retrieval and vector infraData size, modest with usageSomewhat5-15%
Evals and QAReleases, not trafficYes5-10%
Monitoring and observabilityTraffic, sub-linearYes3-8%
Guardrails and safetyUsage (linear, small)Yes2-8%
Engineering timeNothing. It is just there.Constantly40-70%

The shares are deliberately wide ranges, because they swing hard with scale. At low volume, engineering time is almost the entire bill and inference is a rounding error. At high volume, inference can become the thing your CFO asks about. The point of the table is not the exact percentages, it is the shape: the line item everyone obsesses over (tokens) is rarely the one that decides the budget.

1. Per-request inference

This is the token bill, and it is the one part of the framework where the public pricing actually applies. You pay for input tokens (the prompt, the system instructions, the retrieved context, the conversation history) and output tokens (what the model generates). Output tokens usually cost several times more than input tokens, so a chatty model is expensive in a way that is easy to miss.

The number people forget here is context. A naive chat feature sends maybe 500 input tokens. A RAG feature that stuffs five retrieved chunks plus a long system prompt plus three turns of history can send 8,000 input tokens per call, and you pay for all of it on every single request. The model did not get more expensive. Your prompt did.

A few things move this line item more than the model choice does:

  • Retrieved context length. Every chunk you inject is billed on every call.
  • Conversation history. If you replay the full transcript each turn, cost grows with conversation length.
  • Output verbosity. Capping max output tokens and prompting for brevity is a real lever.
  • Caching. If a large chunk of your prompt is stable (a fixed system prompt, a reference document), prompt caching can cut the input cost of that portion by roughly 90% on repeat requests. For high-traffic features with a big shared prefix, this is the single highest-leverage optimization available.

2. Retrieval and vector infrastructure

If your feature does retrieval, you are running infrastructure that has to stay up regardless of traffic. A managed vector database, the embedding calls to index your corpus, and the re-indexing jobs when content changes. For a moderate corpus this might be a few hundred dollars a month, flat. It barely moves with request volume, which is exactly why budgeting it per request gives you a misleading number.

Embedding the corpus is a one-time-ish cost (you re-embed when content changes or when you switch embedding models), and it is usually small. The recurring cost is keeping the index hosted and queryable. Worth saying plainly: if you are deciding whether RAG is even the right pattern, that is a design question with its own tradeoffs, and I have written about generative AI development approaches that weigh retrieval against alternatives like longer context windows or fine-tuning. The cost framework here assumes you have already decided retrieval is worth it.

3. Evals and QA

This is the category teams skip, and then pay for in incidents. An LLM feature has no compile-time guarantees. The only way you know a prompt change did not break something is to run it against a test set and grade the outputs. Building that eval suite is engineering work, and running it costs inference tokens (often on a more expensive model acting as a judge).

The cost here is mostly time: someone builds a representative test set, writes the grading logic, and keeps both current as the feature evolves. The recurring compute is the eval runs themselves, which fire on every meaningful prompt or model change rather than on user traffic. Budget it as a fixed monthly cost plus a chunk of one engineer's time. Skipping it does not save money, it defers the cost to your on-call rotation.

4. Monitoring and observability

You cannot operate what you cannot see. For an LLM feature that means logging full prompts and responses, tracking latency and token usage per request, and capturing user feedback signals (thumbs up/down, escalations, retries). Some of this is a SaaS observability bill. Some of it is storage for the logs, which adds up faster than you expect because prompts and responses are large text blobs.

The cost scales with traffic but sub-linearly, since you typically sample rather than store everything at full fidelity. The forgotten cost is the storage and the dashboards, not the tooling license. Plan for it, because the first time the model starts behaving oddly in production, this is the only thing standing between you and guessing.

5. Guardrails and safety

If your feature touches user-generated input or produces user-facing output, you probably need some checks: prompt-injection filtering, PII detection, content moderation, output validation. Some of these are extra model calls (a classifier pass before or after the main call), so they add a small per-request cost. Some are libraries you run yourself, which is engineering time rather than tokens.

For most B2B features this is a small line item. For consumer-facing or regulated products it can be substantial, both in per-call classifier costs and in the engineering effort to get it right. Either way, budget something. Zero is the wrong number.

6. Engineering time

Here is the line item that dominates and that almost nobody puts in the spreadsheet. Someone has to build this feature, and then someone has to keep it working. Prompts drift, models get deprecated, retrieval quality degrades as the corpus grows, and edge cases surface in production that no one anticipated.

A rough split I have seen hold up: building a production-grade LLM feature is a few engineer-months. Maintaining it is something like 10-20% of an engineer's time on an ongoing basis, more if it is central to your product. At a loaded cost of a senior engineer, that maintenance alone often exceeds the entire inference bill at moderate scale. If your budget does not have a person in it, your budget is fiction.

A worked example

Let me put real numbers on a customer support assistant. Assume a B2B SaaS company, the assistant handles tier-one questions over a documentation corpus using RAG, and it serves 50,000 conversations a month. Each conversation averages three turns. Per turn we send roughly 6,000 input tokens (system prompt, four retrieved chunks, history) and generate roughly 600 output tokens.

That is 150,000 turns a month. Input: 150,000 turns times 6,000 tokens equals 900M input tokens. Output: 150,000 times 600 equals 90M output tokens.

I will price inference at $3 per million input tokens and $15 per million output tokens, which is in the range of a mid-tier capable model in 2026. I will also assume prompt caching cuts the effective input cost roughly in half, because the system prompt and much of the retrieved context repeat within a conversation.

Line itemBasisMonthly cost
Inference, input tokens900M tokens, ~50% cached, blended ~$1.65/M$1,485
Inference, output tokens90M tokens at $15/M$1,350
Vector database (managed)Flat, moderate corpus$400
Embedding and re-indexingPeriodic, small corpus$80
Eval runsPer release, model-as-judge$250
Monitoring and observabilitySaaS plus log storage$350
Guardrails (classifier pass)~150K extra small calls$200
Engineering maintenance~15% of one senior engineer$3,000
Total~$7,115/month

Look at where the money is. Inference is about $2,835, which is 40% of the bill. The maintenance engineer is $3,000, which is 42%. The whole supporting cast of infra, evals, monitoring, and guardrails is about $1,280, which is the other 18%. If you had budgeted this feature off the token price alone, you would have planned for roughly $2,800 and been wrong by more than half.

And the maintenance figure here is conservative. I used 15% of one engineer. If the assistant is core to the product and gets active iteration, that number climbs, and it climbs faster than the token bill does.

The back-of-envelope formula

When someone asks "roughly what will this cost" in a meeting, here is what I run mentally:

SCSS
Monthly cost ≈ (requests × cost-per-request) + fixed infra + (maintenance fraction × loaded engineer cost)

Where:

  • cost-per-request = (input tokens × input price) + (output tokens × output price), adjusted down for caching if you have a stable prefix.
  • fixed infra = vector DB + monitoring + eval compute. For most features this sits somewhere between a few hundred and a couple thousand dollars a month, fairly flat.
  • maintenance fraction = 0.1 to 0.3 of an engineer, depending on how central and how volatile the feature is.

The useful instinct this formula builds is knowing which term dominates at your scale. Below a few hundred thousand requests a month, the engineer term wins and the token math barely matters. Above a few million, inference starts to rival it and caching plus output-token discipline become worth real engineering attention. Knowing which regime you are in tells you where to spend your optimization effort, and where not to bother.

What to take into the budget meeting

Three things, if nothing else stuck.

First, budget the unit, then multiply. A per-request number times an honest volume forecast is defensible. A lump sum is a guess.

Second, the costs that get forgotten (evals, monitoring, guardrails, and above all engineering time) are not optional extras. They are the difference between a demo and a feature that survives contact with real users. Leaving them out of the budget does not make them go away, it just moves them to a quarter where you were not expecting them.

Third, put a person in the spreadsheet. The model is rarely the expensive part. The people who keep the model useful are. If your LLM budget has no salary line, redo it before you take it to anyone who controls money.