
Should We Self-Host an LLM? A Cost and Control Framework
Self-hosting an open-weight model can save money or quietly cost more. A CTO's framework for deciding based on volume, privacy, control, and the true total cost.
Key takeaways
- For most teams hosted APIs win until you hit very high request volume or have data that legally cannot leave your control.
- A single GPU for a mid-size open-weight model runs roughly 1,100 to 2,200 dollars per month around the clock, before any ops cost.
- Self-hosting only starts to win at high volume, around 2 billion tokens a month where hosted bills reach tens of thousands of dollars.
- The cost teams underestimate most is people: a senior infra or ML engineer's time can dwarf the GPU savings you were chasing.
For most teams, the answer is no, at least not yet. Hosted APIs win on cost and effort until one of two things forces your hand: you hit a request volume where per-token pricing stops making sense, or you have data that legally or contractually cannot leave your control. If neither of those is true for you today, self-hosting is usually a distraction dressed up as a cost-saving measure.
That said, "most teams" is not "your team." This post is the framework I use with CTOs who are weighing the two. It is not a setup guide (we have one of those). It is about the decision itself, and how to make it without getting talked into a GPU fleet you do not need.
The four real decision drivers
Strip away the hype and there are four things that actually move this decision. Everything else is detail.
- Request volume, which sets your break-even against hosted pricing.
- Data privacy and residency, which can override cost entirely.
- Control over the model and your uptime, which matters more at scale than people expect.
- The hidden cost of running it yourself, which is the part teams consistently underestimate.
Let me take them one at a time.
Volume: the only number that makes self-hosting cheaper
Hosted APIs charge per token. Self-hosting charges per GPU-hour, whether or not you send it any traffic. So the math is simple in shape: hosted is variable cost, self-hosted is mostly fixed cost. Below some volume, you are paying for idle GPUs. Above it, you are paying a flat rate while the hosted bill keeps climbing.
The crossover point is your break-even. The trap is that people compute it using only the GPU rental price and forget the second half of self-hosting cost (the people who keep it running). I will get to that.
A rough way to think about it: a single GPU capable of serving a mid-size open-weight model costs somewhere in the range of $1.50 to $3.00 per hour on demand, which is roughly $1,100 to $2,200 per month if you keep it running around the clock. One GPU realistically serves a modest, steady stream of requests, not a spiky consumer load. If your traffic is bursty, you either over-provision (pay for peak all month) or build autoscaling (more engineering).
Privacy and residency: the override
This is the driver that ignores the cost math. If you are handling protected health information, certain financial records, defense-adjacent data, or you have a contract that says customer data stays inside a specific boundary, then the hosted-versus-self-hosted question may already be decided for you regardless of price.
Two clarifications worth making, because they change the answer:
- Many hosted providers now offer business tiers that do not train on your data and will sign a BAA or a data processing agreement. For a lot of "we can't send data to an AI vendor" worries, that is enough. Check before you assume you need to self-host.
- Residency is narrower than privacy. "Data must stay in the US" or "in the EU" is often satisfiable by picking the right region on a hosted provider. You do not always need to own the machine to control where the bytes live.
Self-hosting becomes genuinely necessary when the data cannot touch a third party at all, or when your compliance team will not accept a vendor's word and wants the model inside your own VPC. That is a real and common situation in regulated industries. It is just not as common as the people selling GPUs would like you to believe.
Control: model stability and your own uptime
Hosted models change under you. A provider can deprecate a version, adjust safety filters, or silently shift behavior, and your carefully tuned prompts drift overnight. If your product depends on a model behaving exactly the same way for years, self-hosting an open-weight model you pin and freeze gives you that stability. Nobody deprecates your weights but you.
The flip side is uptime. When a hosted API has an outage, it is the vendor's problem and usually their SLA. When your self-hosted endpoint goes down at 2am, it is your on-call engineer's problem. You trade dependence on someone else's reliability for full ownership of your own. Whether that is an upgrade depends entirely on whether you have the team to carry it.
The hidden costs nobody puts in the spreadsheet
Here is where most break-even calculations fall apart. The GPU bill is the visible cost. These are the ones that show up later:
| Hidden cost | What it actually is |
|---|---|
| GPU provisioning | On-demand is expensive; reserved is cheaper but locks you in; spot is cheap but can vanish mid-request. |
| Inference engineering | Serving frameworks, batching, quantization, KV cache tuning. This is a specialty, not a weekend. |
| On-call and ops | Someone owns the pager. Monitoring, autoscaling, failover, capacity planning. |
| Model upgrades | Open weights move fast. Staying current means re-evaluating, re-testing, and re-tuning every few months. |
| Utilization waste | A GPU at 20% utilization still bills at 100%. Idle time is pure loss. |
The one I see ignored most often is people. A senior infra or ML engineer in the US is a six-figure salary. If self-hosting consumes even a meaningful slice of one person's time on an ongoing basis, that cost can dwarf the GPU savings you were chasing. This is exactly the gap where teams bring us in to build and operate the inference layer, which is part of our enterprise AI development work, so they get the cost profile of self-hosting without standing up a whole platform team to babysit it.
A decision framework you can run in ten minutes
Walk these in order. Stop at the first one that gives you a clear answer.
Step 1. Is there a hard data constraint? If law or contract says the data cannot go to a third party, and a vendor BAA or DPA will not satisfy your compliance team, self-host. The cost math does not matter. Skip to planning.
Step 2. Is there a residency constraint only? If the requirement is just "data stays in region X," check whether your hosted provider offers that region first. If yes, stay hosted. If no acceptable region exists, self-hosting moves up the list.
Step 3. What is your steady-state volume? Estimate tokens per month at realistic load, not your launch-day dream. Run it through the break-even table below. If you are below the medium tier, stay hosted. The savings are not real once you add ops cost.
Step 4. Do you need model stability you cannot get hosted? If a deprecation would genuinely break your product and no hosted version-pinning option covers you, that is a point toward self-hosting. On its own it rarely justifies the switch, but combined with high volume it does.
Step 5. Do you have, or will you hire, the team? Be honest. If nobody on staff can own GPU ops and inference tuning, and you are not budgeting to hire or outsource it, you are not ready to self-host no matter what the token math says. Stay hosted until you are.
If you reach the end without a clear "self-host" signal, you have your answer. Hosted wins by default, and that is fine. It is the right call for the majority of teams.
Break-even: hosted vs self-hosted by volume
These numbers are illustrative, not a quote. They assume a mid-size open-weight model and US pricing as of early 2026. The point is the shape of the curve, not the exact dollars. Plug in your own provider rates before you decide.
| Monthly volume | Hosted API (per month) | Self-hosted, GPU only | Self-hosted, fully loaded | Winner |
|---|---|---|---|---|
| Low (~5M tokens) | ~$50 to $150 | ~$1,500 (one idle GPU) | ~$5,000+ (GPU + ops slice) | Hosted, by a mile |
| Medium (~150M tokens) | ~$1,500 to $4,500 | ~$1,500 to $3,000 | ~$6,000 to $9,000 | Hosted, once ops counts |
| High (~2B+ tokens) | ~$20,000 to $60,000 | ~$4,500 to $9,000 (a few GPUs) | ~$12,000 to $20,000 | Self-hosted starts to win |
A few things to read out of this table:
- At low volume it is not close. Hosted is cheaper than the electricity, let alone the engineer.
- At medium volume, the GPU-only column looks competitive, and that is the column that fools people. Add the fully loaded cost (ops, on-call, upgrades) and hosted still wins for most teams.
- At high volume the picture flips. When your hosted bill is in the tens of thousands per month, a couple of well-utilized GPUs plus a real ops budget can come out ahead, and the gap widens as you scale.
The crossover sits somewhere in the high-medium range for most workloads, and it moves based on how good your utilization is. Spiky traffic pushes the break-even higher (you pay for peak, use the average). Steady, predictable traffic pulls it lower (your GPUs stay busy).
So what should most CTOs actually do
Start hosted. Almost always. It is cheaper at the volumes most products run at, it has no ops burden, and it lets you ship while you learn what your real traffic looks like. Use a provider tier that signs a DPA so you are not boxed in by privacy worries you could have avoided.
Then watch two signals. The first is your hosted bill crossing into five figures a month with steady, predictable load, that is your cue to model self-hosting seriously. The second is a privacy or residency requirement you cannot satisfy with a hosted region or contract, that is your cue to self-host regardless of cost.
Until one of those fires, self-hosting is optimizing a cost that is not yet your biggest problem. When one of them does fire, do the full math, the loaded one with people in it, and make the call with eyes open. That is the whole framework. The hard part was never the GPUs. It was being honest about volume and headcount before the spreadsheet talked you into something.