Tutorial // DevTools2026-06-1511 min read

Build a Code Completion Backend with Codestral

Codestral is Mistral's model built for code. Here's how to stand up a fill-in-the-middle code completion backend you can wire into an editor.

V
Varun Raj ManoharanFounder & Principal Engineer
CodestralMistralCode CompletionTutorial

Key takeaways

  • Codestral's fill-in-the-middle endpoint takes a prefix and suffix and generates the code between them, fitting the surrounding code better than a chat prompt.
  • Latency is the priority for inline completion, and the biggest levers are keeping max_tokens small (64 to 128) and using stop sequences like a blank line.
  • Debounce requests around 250ms in the editor and use an AbortController to cancel in-flight calls so stale completions never land after the cursor moves.
  • Cache completions keyed on the tail of the prefix and head of the suffix, the parts nearest the cursor that drive the result, to avoid paying for repeats.

Mistral has been busy. The last few months brought Mistral Large 3 and Medium 3.5 into the lineup, both general-purpose models that happen to be good at code. But for the narrow job of completing code inside an editor, the model I keep reaching for is still Codestral. It is purpose-built for code, it speaks fill-in-the-middle natively, and it has a dedicated endpoint that does exactly one thing well.

That endpoint matters more than the model name. Most LLM code suggestions come from a chat call where you stuff a prompt full of "here is the file, here is the cursor, please continue" and hope the model behaves. Fill-in-the-middle (FIM) is different. You hand the model the code before the cursor as a prefix and the code after the cursor as a suffix, and it generates the piece that goes in between. The model has real context on both sides, so the completion actually fits the surrounding code instead of running off into a tangent.

This tutorial builds a small HTTP backend around that FIM endpoint. Editor talks to your backend, your backend talks to Codestral, completions come back. Nothing fancy, but enough structure that you can wire it into a VS Code extension or a CodeMirror widget without rethinking the whole thing later.

What you'll need

  • Node.js 20 or newer. The code uses the built-in fetch, so nothing extra to install for the HTTP calls.
  • A Mistral API key. Create one in the Mistral console and put it in an environment variable. Codestral has its own usage terms, so check the pricing page before you point production traffic at it.
  • The current Codestral model id, which is codestral-2508 (the v25.08 release). You can also use the codestral-latest alias if you want Mistral to roll you forward automatically, though pinning a dated id is safer for a backend you do not want shifting under you.
  • An editor or test client to send requests. Curl is fine for the first few steps.

I am going to keep the dependency list short on purpose. You can swap in Express or Fastify later, but the standard library covers everything we need to demonstrate the shape of the thing.

Step 1: Call the FIM endpoint directly

Before wrapping anything, confirm the raw call works. The FIM endpoint lives at https://api.mistral.ai/v1/fim/completions. It takes a prompt (your prefix), an optional suffix, a model, and the usual sampling controls. Here is a curl call that asks Codestral to fill in the body of a function.

Shell
curl https://api.mistral.ai/v1/fim/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -d '{
    "model": "codestral-2508",
    "prompt": "def fib(n):\n    ",
    "suffix": "\n\nprint(fib(10))",
    "max_tokens": 128,
    "temperature": 0.2
  }'

The prompt is everything to the left of the cursor and suffix is everything to the right. The model returns the middle. Note the low temperature. For inline completion you usually want the model to be fairly deterministic, since wild creativity in the middle of someone's function is rarely helpful.

The response looks like a standard completion object: an id, a model, a usage block with token counts, and a choices array. The text you care about is in the first choice. A trimmed version reads like this.

JSON
{
  "id": "fim-cmpl-...",
  "object": "chat.completion",
  "model": "codestral-2508",
  "usage": { "prompt_tokens": 18, "completion_tokens": 31, "total_tokens": 49 },
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "if n < 2:\n        return n\n    return fib(n - 1) + fib(n - 2)" },
      "finish_reason": "stop"
    }
  ]
}

If that works, the rest is plumbing.

Step 2: A thin client function

Pull the API call into a single function so the HTTP server does not have to know anything about Mistral. This keeps the model details in one place, which is where you will want them when you tweak temperature or swap the model id later.

JavaScript
// codestral.js
const FIM_URL = "https://api.mistral.ai/v1/fim/completions";

export async function fimComplete({ prefix, suffix, maxTokens = 96, stop }) {
  const res = await fetch(FIM_URL, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${process.env.MISTRAL_API_KEY}`,
    },
    body: JSON.stringify({
      model: "codestral-2508",
      prompt: prefix,
      suffix: suffix ?? "",
      max_tokens: maxTokens,
      temperature: 0.2,
      stop,
    }),
  });

  if (!res.ok) {
    const detail = await res.text();
    throw new Error(`Codestral FIM failed: ${res.status} ${detail}`);
  }

  const data = await res.json();
  return data.choices?.[0]?.message?.content ?? "";
}

A couple of things to call out. The function names its inputs prefix and suffix because that is how an editor thinks about it, then maps prefix onto the API's prompt field. I also default max_tokens low. Inline completions are short by nature, and capping the length is the single biggest lever you have on latency. We will come back to that.

Step 3: Wrap it in an HTTP backend

Now the server. It accepts a POST with the prefix and suffix from the editor, calls the client, and returns the completion as JSON. I am using the built-in node:http module to keep the surface small.

JavaScript
// server.js
import { createServer } from "node:http";
import { fimComplete } from "./codestral.js";

const server = createServer(async (req, res) => {
  if (req.method !== "POST" || req.url !== "/complete") {
    res.writeHead(404).end();
    return;
  }

  let body = "";
  for await (const chunk of req) body += chunk;

  let payload;
  try {
    payload = JSON.parse(body);
  } catch {
    res.writeHead(400, { "Content-Type": "application/json" });
    res.end(JSON.stringify({ error: "invalid JSON" }));
    return;
  }

  const { prefix = "", suffix = "" } = payload;

  try {
    const completion = await fimComplete({
      prefix,
      suffix,
      stop: ["\n\n"],
    });
    res.writeHead(200, { "Content-Type": "application/json" });
    res.end(JSON.stringify({ completion }));
  } catch (err) {
    res.writeHead(502, { "Content-Type": "application/json" });
    res.end(JSON.stringify({ error: String(err.message) }));
  }
});

server.listen(8787, () => {
  console.log("code completion backend on http://localhost:8787");
});

Test it the same way the editor will.

Shell
curl http://localhost:8787/complete \
  -H "Content-Type: application/json" \
  -d '{"prefix": "function greet(name) {\n  return ", "suffix": "\n}"}'

You should get back something like {"completion": "hello, ${name};"}. That is the whole round trip. Editor sends two strings, backend returns one.

Notice the stop: ["\n\n"] I passed in the server. For an inline suggestion you often want to stop at the first blank line so the model does not try to write three more functions you did not ask for. Stop sequences are the cleanest way to keep completions tight, and they save tokens too.

Step 4: Debounce on the way in

An editor fires a completion request on nearly every keystroke if you let it. That is a great way to burn through your token budget and rate limits in an afternoon. The fix is debouncing, and the honest place to do it is the client, but you can also guard the backend so a misbehaving client cannot flood you.

Here is a small per-connection debounce helper you would run in the editor extension, not the server. It waits for a quiet gap before sending.

JavaScript
function debounce(fn, ms) {
  let timer;
  return (...args) =>
    new Promise((resolve, reject) => {
      clearTimeout(timer);
      timer = setTimeout(() => {
        Promise.resolve(fn(...args)).then(resolve, reject);
      }, ms);
    });
}

const requestCompletion = debounce(async (prefix, suffix) => {
  const res = await fetch("http://localhost:8787/complete", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ prefix, suffix }),
  });
  return (await res.json()).completion;
}, 250);

A 250 millisecond gap is a reasonable starting point. Shorter feels snappier but sends more requests when someone is typing fast. Longer saves money but starts to feel laggy. There is no universally right number, so tune it against your own editor feel.

You also want to cancel in-flight requests when the user keeps typing. The cleanest way is an AbortController tied to the request, aborted whenever a newer keystroke arrives. Otherwise a slow response can land after the cursor has already moved, and inserting stale text is worse than inserting nothing.

Step 5: Cache identical contexts

People backspace and retype the same thing constantly. They also sit on the same cursor position while reading. A small cache keyed on the prefix and suffix avoids paying for completions you have already seen. Add it to the server so every client benefits.

JavaScript
const cache = new Map();
const MAX_CACHE = 500;

function cacheKey(prefix, suffix) {
  // last chunk of context is what actually drives the completion
  return prefix.slice(-400) + "�" + suffix.slice(0, 200);
}

async function cachedComplete(prefix, suffix) {
  const key = cacheKey(prefix, suffix);
  if (cache.has(key)) return cache.get(key);

  const completion = await fimComplete({ prefix, suffix, stop: ["\n\n"] });

  if (cache.size >= MAX_CACHE) {
    cache.delete(cache.keys().next().value); // evict oldest
  }
  cache.set(key, completion);
  return completion;
}

Keying on the tail of the prefix and the head of the suffix is deliberate. Those are the parts nearest the cursor, and they are what the model leans on most. Hashing the entire file would make the cache nearly useless, since one character anywhere would miss. This is a deliberately simple in-memory cache, fine for a single process. If you run multiple instances, move it to something shared like Redis and put a short time-to-live on entries so stale suggestions age out.

Swap cachedComplete in for the direct fimComplete call in your request handler and you are done with the core backend.

Gotchas

Latency is the whole game for inline completion. Nobody waits for an inline suggestion. If it does not show up in a few hundred milliseconds, the user has already typed past it. The two biggest levers are max_tokens (keep it small, 64 to 128 is plenty for inline) and stop sequences (cut the generation off the moment it has produced something usable). Debouncing helps too, but it shifts latency rather than removing it. Be honest with yourself about the ceiling here. A network hop to a hosted model plus generation time means inline completion will never feel as instant as a local model running on the same machine, and that is a real tradeoff to weigh against quality.

Context windowing. Codestral has a large context window, but sending the entire file on every keystroke is wasteful and slow. Send a window around the cursor instead. A few hundred lines of prefix and a shorter suffix usually capture enough local structure. If you have a symbol index or relevant imports, prepend just those rather than the whole project. More context is not free, and past a point it does not improve the completion either.

Stop tokens need attention. Without a stop sequence the model happily generates well past the point you wanted. For inline use, stopping at a blank line (\n\n) or a closing brace on its own line keeps suggestions to a single logical unit. Watch the finish_reason in the response. If it reads length instead of stop, you hit the token cap, which usually means your max_tokens is too low for that context or your stop sequences are not matching.

Cost adds up quietly. Each keystroke that gets past the debounce is a billable call. Caching repeated contexts and a sensible debounce window together cut request volume more than any single optimization. Watch the usage field in responses during development so you have a real sense of token consumption before you ship. Codestral is priced for code-specific use, so confirm the current rates rather than assuming they match a general chat model.

Wrapping up

The backend here is small because the FIM endpoint does the hard part. You give Codestral a prefix and a suffix, it returns the middle, and everything else is the engineering around making that fast and cheap enough to run on every keystroke. Debounce to control volume, cache to avoid repeat work, cap tokens and set stop sequences to keep both latency and cost in check.

From here the next steps are mostly editor-side: streaming partial completions so they appear as they generate, ranking multiple candidates, and deciding when a suggestion is good enough to show versus stay quiet. The server you have now is the foundation those features build on, and none of them require changing the shape of the FIM call itself.