Tutorial // Infrastructure2026-06-1812 min read

Self-Host an Open-Weight Coding LLM with vLLM

Open-weight coding models are good enough to self-host now. A practical guide to serving one with vLLM and calling it through an OpenAI-compatible API.

Varun Raj ManoharanFounder & Principal Engineer

vLLMOpen SourceSelf-HostingLLMTutorial

Key takeaways

VRAM is the constraint that decides everything, with weights at FP16 taking roughly two bytes per parameter before the KV cache for context.
Because vLLM exposes an OpenAI-compatible API, moving existing code from a hosted provider is usually a two-line change of base URL and API key.
Quantization and context length are the two memory dials, and capping max-model-len to 32K fits most coding work while avoiding out-of-memory errors.
Self-hosting wins with steady high-volume traffic or data-residency rules, but a hosted API is often cheaper for bursty or low-volume usage.

A year ago, self-hosting a coding model felt like a hobby project. The open weights you could download were a step behind the hosted APIs, and the gap was big enough that you'd notice it on real work. That has changed. The open-weight releases over the last few months have been genuinely good. Qwen3-Coder-Next shipped with 80B total parameters but only 3B active per token, which means it runs far cheaper than its quality would suggest. There are dense releases too, like the Qwen3.6 family, and a steady drip of fine-tunes and quantizations on Hugging Face within days of each base model landing.

So the question is no longer "is the open model good enough." For a lot of coding tasks, it is. The question is whether running it yourself makes sense for your situation, and if so, how to do it without fighting the tooling for a week.

This guide answers the second part. We'll serve an open-weight coding model with vLLM, which is the inference server most teams reach for when they want throughput, and then call it through the OpenAI-compatible API it exposes. Because the API matches OpenAI's shape, most code you already have pointed at a hosted provider will work against your own box with a changed base URL. I'll be honest about the parts that are annoying, and there's a section near the end on when you should just pay for a hosted API instead.

What you'll need

The hard requirement is a GPU with enough VRAM. This is the part people underestimate, so let's be concrete about it.

A model's weights at full precision (FP16 or BF16) take roughly two bytes per parameter. A 7B model is around 14 GB just for weights, before you account for the KV cache that holds the context. A 32B dense model at FP16 wants something like 64 GB, which means it does not fit on a single 24 GB consumer card without quantization. Mixture-of-experts models like Qwen3-Coder-Next are trickier to reason about: only 3B parameters are active per token, so they're fast, but all 80B parameters still have to live in memory. The active-parameter count helps your speed, not your VRAM bill.

Here's a rough map for planning:

A single 24 GB card (RTX 4090, RTX 5090) comfortably runs a 7B to 14B model at FP16, or a larger model if you quantize to 4-bit and keep the context modest.
A single 48 GB card (RTX 6000 Ada, L40S) runs a 32B dense model with quantization, or a 14B with a long context window.
An 80 GB card (A100, H100) runs a 32B at FP16 with room for context, and is the realistic entry point for the bigger MoE models.
For Qwen3-Coder-Next at its full 256K context, you're looking at multiple 80 GB GPUs with tensor parallelism. The model card's own deployment example assumes a 2-to-4 GPU setup.

On top of the GPU you'll want a recent Linux box with NVIDIA drivers and CUDA, Python 3.10 or newer, and disk space for the weights (an 80B model is a large download). I'll use a model that fits a single 24 GB card for the runnable parts of this guide, then show the bigger-model command separately so you can scale up when your hardware allows.

Installing vLLM

vLLM moves fast, and the cleanest install path uses uv to manage the virtual environment and pull the right Torch backend automatically:

Shell

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

If you'd rather use plain pip, this works too:

Shell

python -m venv .venv
source .venv/bin/activate
pip install vllm

Either way, vLLM pulls in a specific build of PyTorch and CUDA libraries, so install it into a fresh environment rather than an existing project. Mixing it with other heavy ML dependencies is the fastest way to spend an afternoon on version conflicts. One thing worth checking: newer models often need a recent vLLM. Qwen3-Coder-Next, for instance, wants vLLM 0.15.0 or later. If a model fails to load with a parser or architecture error, an outdated vLLM is the first thing to suspect.

Serving the model

The basic command is short. Point vllm serve at a Hugging Face model ID and it downloads the weights and starts an HTTP server:

Shell

vllm serve Qwen/Qwen2.5-Coder-7B-Instruct

That's a real, current coding model that fits a 24 GB card at FP16, which makes it a good one to start with while you get the plumbing working. The server listens on http://localhost:8000 by default and exposes an OpenAI-compatible API. The first run is slow because it's downloading several gigabytes of weights; later runs use the local cache.

When the server prints a line about the application startup being complete, it's ready. You can sanity-check it without any client code:

Shell

curl http://localhost:8000/v1/models

That returns a JSON list with the model you loaded. If you get a connection refused, the server is still warming up or it crashed during load, and the terminal running vllm serve will tell you which.

For a larger model, you add a few flags. Here is the deployment command from the Qwen3-Coder-Next model card, which spreads the model across two GPUs and turns on the tool-calling parser that coding agents rely on:

Shell

vllm serve Qwen/Qwen3-Coder-Next \
  --port 8000 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

--tensor-parallel-size 2 splits the model's layers across two GPUs, which is how you fit something too big for one card. The two tool-related flags matter if you plan to use the model as an agent that calls functions: without them, vLLM won't parse the model's tool-call output into the structured format the OpenAI API expects.

Calling the OpenAI-compatible endpoint

This is the part that makes vLLM pleasant to work with. Because it speaks the OpenAI protocol, you use the official OpenAI client and just repoint it at localhost. The API key can be any non-empty string, since vLLM doesn't check it by default.

In Python:

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a precise coding assistant."},
        {"role": "user", "content": "Write a Python function that returns the nth Fibonacci number iteratively."},
    ],
)

print(response.choices[0].message.content)

The model value has to match the ID you served. The rest is identical to what you'd write against a hosted provider. Streaming works the same way too, by passing stream=True and iterating over the chunks.

The JavaScript client is the mirror image:

JavaScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-Coder-7B-Instruct",
  messages: [
    { role: "system", content: "You are a precise coding assistant." },
    { role: "user", content: "Write a TypeScript function to debounce an async call." },
  ],
});

console.log(response.choices[0].message.content);

If you have an existing app built on the OpenAI SDK, this is usually a two-line change: swap the base URL and drop the real API key. That portability is the practical reason to prefer vLLM's OpenAI mode over a bespoke inference protocol. You can develop against a hosted API, then move to your own server later without rewriting the call sites.

Quantization and context, the two VRAM dials

Once the basic setup works, the two settings you'll actually tune are quantization and context length, because both trade quality or capability for memory.

Quantization shrinks the weights by storing them in fewer bits. FP8 roughly halves the memory of a BF16 model with a small accuracy cost, and on modern hardware it's often close to free in quality terms. Going down to 4-bit (via formats like AWQ or GPTQ) saves more but the quality drop becomes noticeable on harder reasoning. Many popular models have prequantized versions on Hugging Face, often published by the original team or by groups like Unsloth, and you serve them by pointing at the quantized repo. For example, an FP8 variant is served the same way as the base model, just with the FP8 model ID. If you grab a community AWQ build, you sometimes need to tell vLLM the scheme:

Shell

vllm serve some-org/Model-AWQ --quantization awq

Context length is the other dial, and it's the one that surprises people. The KV cache grows with the number of tokens you hold in context, and at long context it can eat more memory than the weights. A model advertised with a 256K window will happily refuse to load if you ask for all 256K on hardware that can't hold the cache. The fix is to cap it:

Shell

vllm serve Qwen/Qwen3-Coder-Next --max-model-len 32768

The Qwen3-Coder-Next card recommends exactly this move, dropping the context to 32,768 if you hit out-of-memory errors. Most coding work fits comfortably in 32K, so this is usually a good trade rather than a painful one. Set the context to what your tasks actually need, not to the model's maximum.

Gotchas

A few things will bite you, so here they are up front.

VRAM is the constraint that decides everything, and the failure mode is loud: vLLM logs an out-of-memory error during startup and exits. When that happens, your levers are quantizing the weights, lowering --max-model-len, lowering --gpu-memory-utilization if something else needs room on the card, or adding GPUs with tensor parallelism. There's no clever flag that conjures memory you don't have.

Throughput and latency pull in opposite directions, and vLLM is built for throughput. It uses continuous batching, which means it serves many requests at once very efficiently, but a single lonely request to an idle server can feel slower than a hosted API that's optimized for that case. If you're serving a team or an agent fleet that fires lots of concurrent calls, vLLM shines. If you just want snappy single-user latency on modest hardware, the experience can be underwhelming, and a tool like Ollama is often a friendlier fit for that workload.

Quantization tradeoffs are real but easy to over-fear. FP8 is usually safe. Aggressive 4-bit quantization on a model that's already small can tip it from useful to frustrating, especially for multi-step reasoning or tool use. Test the quantized version on your own tasks before you commit, because benchmark scores don't always predict how it behaves on your codebase.

And the honest one: sometimes a hosted API is just cheaper. A GPU that can run a 32B model costs real money whether you rent it by the hour or buy it outright, and it costs that money whether or not you're sending requests. If your usage is bursty or low-volume, the math often favors a hosted provider, where you pay per token and nothing while idle. Self-hosting wins when you have steady, high-volume traffic, strict data-residency requirements that forbid sending code to a third party, or a need to run a specific open model that no provider hosts. If none of those apply, don't talk yourself into running a server for the principle of it.

Wrapping up

The open-weight coding models available right now are good enough that self-hosting is a real option rather than a compromise, and vLLM makes the serving side about as painless as it gets. The whole flow is short: install vLLM, run vllm serve against a Hugging Face model ID, and call the OpenAI-compatible endpoint with the client you already know. Quantization and context length are the two dials you'll spend time on, and VRAM is the wall you'll keep running into.

Start small. Get the 7B model answering on a single card, confirm your existing OpenAI client code works against localhost, and only then scale up to the bigger MoE models on multi-GPU hardware. If, after measuring your actual usage, a hosted API turns out cheaper and simpler, that's a perfectly good answer too. The point of this exercise is options, and now you have them.