Insights // Cost2026-05-089 min read

Cutting LLM Cost 50% Without Wrecking Quality

A practical playbook for bringing down LLM spend: model routing, prompt caching, context discipline, and batching, with the quality tradeoffs spelled out.

V
Varun Raj ManoharanFounder & Principal Engineer
CostLLMOptimizationCTO

Key takeaways

  • Stand up an eval with a quality score you trust before changing any setting, since cost is easy to measure but quality stays invisible until customers complain.
  • Work the cheap, low-risk levers first: prompt caching saves 30 to 70 percent on cached input, and capping output length saves 10 to 25 percent.
  • Model routing offers the biggest dollar savings of 30 to 60 percent, but needs a fallback that retries low-confidence answers on the larger model.
  • Self-hosting usually costs more than the API until you have very high, steady volume and a team that already runs production infrastructure.

Overview

You can usually cut an LLM bill in half without your users noticing, as long as you go after the right levers in the right order and you measure quality the whole way down. The waste is rarely the price per token. It's that you're sending your most expensive model a flood of easy requests, resending the same 4,000-token system prompt every call, and stuffing the context window with retrieved chunks the model never reads.

Here is the order I'd work through, ranked by what it pays back versus what it costs you to set up. Cheap wins first.

I'll be honest about the part most cost guides skip: some of these cuts aren't worth it. Self-hosting is the obvious trap. And every cut on this list can quietly degrade output if you ship it without an eval to catch the regression. The whole playbook only works if you measure.

The one rule before you cut anything

Set up an eval before you touch a single setting. You need a fixed set of representative inputs (50 to 200 is plenty to start) and a way to score the outputs, whether that's an exact-match check, a rubric scored by a stronger model, or a human spot-check on a sample.

Without this, you are flying blind. Cost is trivial to measure (the bill goes down) and quality is invisible until a customer complains. So you cut, the dashboard looks great, and three weeks later support tickets climb and nobody connects it to the model swap you made. An eval turns "I think it's still fine" into a number you can defend in a review.

If you only do one thing from this post, do this one. The rest of the playbook assumes you have a quality number you trust. This is the kind of measurement-and-rollback discipline we build into MLOps and AgentOps work, because cost optimization without it is just guessing with extra steps.

The levers, ranked

Here's the whole playbook on one screen. Numbers are rough and depend heavily on your traffic mix, so treat them as where-to-start guidance, not promises.

LeverEffortTypical savingsQuality risk
Prompt cachingLow30 to 70% on cached inputVery low
Shorter outputs (cap max tokens)Low10 to 25%Low to medium
Model routing (easy work to cheap models)Medium30 to 60%Medium
Context trimming (retrieval discipline)Medium20 to 40% on inputLow to medium
Batching (async jobs)Low to mediumup to 50% on eligible trafficNone (offline only)
Self-hostingHighNegative until high volumeHigh (you now own reliability)

Work top to bottom. Caching and output caps are nearly free and barely move quality, so there's no reason not to do them first. Routing and self-hosting are where the real money and the real risk both live.

Lever 1: Prompt caching

This is the highest payoff per hour of work, full stop. If you send a large, stable prefix on most requests (a system prompt, a tool schema, a long set of instructions, a fixed document), the provider can cache it and charge you a fraction for the repeated part.

The mechanics are simple. Put everything stable at the front of the prompt and everything variable at the end. The cache matches on the prefix, so order matters.

Python
# Stable prefix first (gets cached), variable input last
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},        # cached
    {"role": "system", "content": TOOL_DEFINITIONS},     # cached
    {"role": "user", "content": user_question},          # not cached
]

The catch worth knowing: cache writes can cost slightly more than a normal token, and caches expire (often in minutes). So caching helps when the same prefix gets reused inside that window. For a chat product with steady traffic, that's almost always. For a job that runs once an hour, the cache is cold every time and you get nothing.

Quality risk here is close to zero. You're sending the exact same tokens, just paying less for them. This is the rare free lunch.

Lever 2: Shorter outputs

Output tokens usually cost several times more than input tokens, so a chatty model is expensive in a way that's easy to miss. Two moves help.

First, cap max_tokens to something realistic for the task. If you need a yes/no plus a sentence, do not leave the ceiling at 4,000. Second, ask for the format you actually want. "Answer in one paragraph" or "return only the JSON" cuts the rambling preamble that adds nothing.

The risk: cap too aggressively and you get truncated answers, which is worse than an expensive complete one. Watch your eval for outputs that hit the ceiling. If they cluster at the limit, the model wanted more room and you cut it off.

Lever 3: Model routing

This is the lever with the biggest pure-dollar upside, and the one most teams skip because it feels like more engineering than it is. The idea: most requests are easy, but you're paying flagship prices for all of them. Send the easy ones to a smaller, cheaper model and reserve the expensive one for the hard ones.

A simple router goes a long way. You don't need an ML classifier to start; rules and heuristics catch most of the value.

Python
def route(request):
    # Cheap model for short, simple, well-bounded asks
    if request.token_count < 500 and not request.needs_reasoning:
        return "small-cheap-model"
    # Flagship for long context, multi-step reasoning, high-stakes output
    return "large-capable-model"

The savings depend entirely on your traffic mix. If 70% of your requests are simple lookups and classifications, routing those to a model that costs a fifth as much is a huge chunk of the bill gone.

Now the honest part. This is where quality risk gets real. The cheap model will be worse at the hard cases, and if your router sends a hard case down the cheap path, the user gets a bad answer. Two safeguards. Run both models against your eval so you know exactly where the small one falls short, and build a fallback: if the cheap model returns low confidence or fails a validation check, retry on the big one. The retry costs you, but it's rare, and it keeps the floor from dropping out.

There's also a tempting cousin of routing: a "cascade" where you always try the cheap model first and escalate on failure. It can work, but you pay for two calls on every escalation, so it only beats plain routing when your cheap model succeeds most of the time. Measure before you assume.

Lever 4: Trim the context (retrieval discipline)

If you're doing RAG, your input bill is probably bloated. The common pattern is to retrieve the top 20 chunks and paste them all in, on the theory that more context is safer. It isn't. You pay for every one of those tokens, and past a point extra chunks hurt quality because the model loses the relevant passage in the noise.

The fix is retrieval discipline. Retrieve fewer, better chunks. Add a reranking step so the top 3 are actually the top 3, then send 3 to 5 instead of 20. You cut input tokens hard and accuracy often goes up, not down, because the signal-to-noise ratio improves.

The risk is real on the other side: trim too far and you drop the chunk that held the answer. So this is another lever you tune against an eval, watching retrieval recall, not just cost. The sweet spot is the smallest context that holds your accuracy steady.

Lever 5: Batching

If you have work that doesn't need an answer right now (overnight enrichment, bulk classification, generating embeddings for a backfill), batch APIs typically run around half price in exchange for a slower, asynchronous turnaround.

There's no quality tradeoff at all. It's the same model and the same tokens. The only constraint is latency, so this applies to background jobs, never to anything a user is waiting on. The mistake is leaving high-volume offline work on the real-time endpoint out of habit. If a job can tolerate a delay measured in minutes or hours, it belongs in a batch.

Lever 6: Self-hosting (usually don't)

This is the one people reach for first and should reach for last. Running an open-weights model on your own GPUs can look dramatically cheaper on a spreadsheet, because you're comparing a per-token API price against raw compute. That spreadsheet lies.

It leaves out the GPU instances sitting idle between bursts, the engineer time to run inference infrastructure, the work to hit the quality of a frontier hosted model, and the reliability you now own at 2am. For most teams, self-hosting costs more than the API until you're at serious, steady volume, and even then only if you have the team to operate it well.

There's a narrow case where it pays: very high and predictable volume, a task a smaller open model handles fine, and a team that already runs production infrastructure. If that's not all three of you, the API is cheaper once you count everything. I've watched more than one team spend six figures of engineering time to "save" on inference and come out behind.

A sane order of operations

Put together, here's how I'd actually run it.

  1. Stand up an eval with a quality score you trust.
  2. Turn on prompt caching and cap output lengths. Nearly free, low risk.
  3. Move eligible offline work to batch.
  4. Trim retrieval context, tuning against the eval.
  5. Add model routing with a fallback for low-confidence cases. This is your biggest lever, so spend the most care here.
  6. Only consider self-hosting if your volume is high and steady enough to justify owning the infrastructure.

After each step, check the eval. If quality holds, keep the change. If it slips, you know exactly which lever did it and you can dial it back. That's the whole trick: cost and quality move together, and the only way to cut one without sacrificing the other is to watch both at the same time.

What I'd leave on the table

Not every cut is worth making. Squeezing the last 5% out of a prompt to shave a few tokens per call rarely pays for the engineering hours, and it makes the prompt brittle. Quantizing a self-hosted model down to where quality wobbles to save on GPU memory is a false economy if it costs you customers. And switching providers purely on sticker price, without re-running your eval on the new model, is how you trade a known quantity for a surprise.

The goal was never the cheapest possible bill. It's the cheapest bill that still does the job. Get the eval in place, work the levers in order, and stop when the next cut starts costing you more in quality than it saves in dollars.