Tutorial // RAG2026-06-2314 min read

Build a RAG Chatbot Over Your Docs with Claude and pgvector

A complete, working tutorial: ingest documents, embed them into Postgres with pgvector, and answer questions with Claude, citations included.

Varun Raj ManoharanFounder & Principal Engineer

RAGClaudepgvectorNext.jsTutorial

Key takeaways

RAG comes down to chunk, embed, store, retrieve by vector distance, and answer with the retrieved text as context, all on Postgres with pgvector.
The pgvector column dimension must match the embedding model's output, and documents and queries must be embedded with the same model or distances become noise.
Number each retrieved chunk in the prompt and tell Claude to cite those numbers, then map them back to real documents for reliable citations.
Instruct Claude to say it does not know when the context lacks the answer, since a confident wrong answer is worse than admitting a retrieval miss.

By the end of this you'll have a chatbot that answers questions from your own documents and tells you which document each answer came from. No vector database service, no framework that hides what's actually happening. Just Postgres with the pgvector extension, an embedding model, and Claude.

I've built this pattern enough times now that I have opinions about the parts that bite you later, so I've put those in a gotchas section near the end. Read it before you ship.

This is for people who write TypeScript and have shipped a Next.js app before. You don't need to know anything about embeddings or retrieval going in. I'll explain the ideas as we hit them, but I won't slow down to explain what an API route is.

Here's the whole shape of it: you chop your documents into chunks, turn each chunk into a vector with an embedding model, and store those vectors in Postgres. When a question comes in, you embed the question the same way, find the chunks whose vectors sit closest to it, and hand those chunks to Claude as context. Claude writes the answer and cites the chunks it used.

What you'll need

Node 20 or later and a Next.js app (App Router). If you're starting fresh, npx create-next-app@latest.
Postgres 14+ with the pgvector extension. I'll show you a Docker one-liner.
An Anthropic API key for Claude.
An OpenAI API key for embeddings. You can swap the embedding provider later, I'll note where the dimension matters.

A note on why two providers. Claude is what answers the question. The embedding model is a separate, cheaper model whose only job is to turn text into vectors so we can measure "closeness" between a question and a chunk. Anthropic doesn't ship a dedicated embeddings endpoint, so I reach for OpenAI's text-embedding-3-small here. Any embedding model works as long as you embed your documents and your queries with the same one, and your Postgres column matches its output dimension.

Install the packages:

Shell

npm install @anthropic-ai/sdk openai pg
npm install -D @types/pg

And set your environment variables in .env.local:

Shell

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
DATABASE_URL=postgres://postgres:postgres@localhost:5432/ragdemo

Step 1: Postgres with pgvector

The fastest way to get a Postgres that already has pgvector compiled in is the official image:

Shell

docker run -d \
  --name rag-postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=ragdemo \
  -p 5432:5432 \
  pgvector/pg17

That image is plain Postgres 17 with the extension available, it isn't loaded until you ask for it. Connect and turn it on:

SQL

CREATE EXTENSION IF NOT EXISTS vector;

pgvector adds a vector column type and a set of distance operators so Postgres can answer "which rows are closest to this vector". That's the entire reason we're using it. The alternative is bolting on a separate vector database, and for most document-Q&A workloads that's a service you don't need to run when the database you already have can do it.

Step 2: The documents and embeddings tables

Two tables. One holds the source documents so you have something to cite back to. The other holds the chunks and their vectors.

SQL

CREATE TABLE documents (
  id          BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  title       TEXT NOT NULL,
  source_url  TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE chunks (
  id           BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  document_id  BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index  INT NOT NULL,
  content      TEXT NOT NULL,
  embedding    vector(1536) NOT NULL
);

The vector(1536) is load-bearing. 1536 is the output dimension of text-embedding-3-small. If you switch to a model that emits 768 or 3072 dimensions, this number has to change with it, or every insert will fail. The column dimension and the model dimension are one fact written in two places, and Postgres will not let you mix them.

ON DELETE CASCADE means deleting a document cleans up its chunks. You'll appreciate that the first time you re-ingest a document that changed.

Now the index. Without one, a similarity search reads every row and computes distance against all of them. Fine for a hundred chunks, painful at a hundred thousand.

SQL

CREATE INDEX ON chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

ivfflat partitions your vectors into lists buckets and, at query time, only searches the buckets nearest your query vector. That makes it an approximate index: it trades a little recall for a lot of speed. lists = 100 is a reasonable starting point; the pgvector docs suggest roughly rows / 1000 for up to a million rows. One catch that trips people up: ivfflat builds its buckets from the data that exists when you create the index. Build the index after you've loaded a representative amount of data, not on an empty table, or the buckets will be meaningless.

Step 3: Chunking

You can't embed a whole 40-page document as one vector and expect good retrieval. The vector would be an average of everything the document says, which is to say it's about nothing in particular. So you split the document into chunks small enough to be about one thing, and embed each chunk.

Here's a chunker that splits on paragraph boundaries and packs paragraphs up to a character budget, with a little overlap between consecutive chunks:

TypeScript

// lib/chunk.ts
export interface Chunk {
  index: number;
  content: string;
}

export function chunkText(
  text: string,
  { maxChars = 1200, overlapChars = 200 }: { maxChars?: number; overlapChars?: number } = {},
): Chunk[] {
  const paragraphs = text
    .split(/\n\s*\n/)
    .map((p) => p.trim())
    .filter(Boolean);

  const chunks: Chunk[] = [];
  let buffer = "";

  const flush = () => {
    if (!buffer) return;
    chunks.push({ index: chunks.length, content: buffer });
    // Carry the tail of this chunk into the next one for context overlap.
    buffer = overlapChars > 0 ? buffer.slice(-overlapChars) : "";
  };

  for (const para of paragraphs) {
    if (buffer && buffer.length + para.length + 2 > maxChars) {
      flush();
    }
    buffer = buffer ? `${buffer}\n\n${para}` : para;
  }
  flush();

  return chunks;
}

The overlap matters more than it looks. A sentence that explains a term can land at the end of one chunk while the sentence that uses the term lands at the start of the next. Without overlap, retrieval can pull the chunk that uses the term and miss the one that defines it. Carrying the last couple hundred characters forward hedges against that. It's not free, overlapping text gets embedded and stored twice, but it's cheap insurance.

I'm splitting on blank lines here because that's where most prose and Markdown put real boundaries. If your documents are code, transcripts, or tables, you'll want a splitter that respects those structures instead. Chunking is the part of RAG where naive defaults quietly cost you the most accuracy, so it's worth tuning to your actual content.

Step 4: Generating embeddings

One small wrapper around the embedding API. It takes a batch of strings and returns a batch of vectors, in order.

TypeScript

// lib/embed.ts
import OpenAI from "openai";

const openai = new OpenAI();

export const EMBEDDING_MODEL = "text-embedding-3-small";
export const EMBEDDING_DIM = 1536; // must match vector(1536) in the schema

export async function embed(texts: string[]): Promise<number[][]> {
  const res = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: texts,
  });
  // The API preserves input order, but sort by index to be safe.
  return res.data
    .sort((a, b) => a.index - b.index)
    .map((d) => d.embedding);
}

Batching is the thing to get right here. Embedding endpoints accept many strings per call, and one call for fifty chunks is far faster and cheaper than fifty calls. The EMBEDDING_DIM constant lives next to the model name on purpose, when someone changes the model, the dimension is right there to change with it, and it's the same number that has to match your SQL column.

Step 5: Ingesting a document

Now wire chunking and embedding into a function that takes a document and writes everything to Postgres. pgvector wants the vector as a string literal like [0.1,0.2,...], so there's a small formatting step.

TypeScript

// lib/ingest.ts
import { Pool } from "pg";
import { chunkText } from "./chunk";
import { embed } from "./embed";

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

// pgvector accepts a bracketed, comma-separated string for a vector value.
function toVectorLiteral(vec: number[]): string {
  return `[${vec.join(",")}]`;
}

export async function ingestDocument(input: {
  title: string;
  sourceUrl?: string;
  text: string;
}): Promise<{ documentId: number; chunkCount: number }> {
  const chunks = chunkText(input.text);
  if (chunks.length === 0) {
    throw new Error("Document produced no chunks.");
  }

  const vectors = await embed(chunks.map((c) => c.content));

  const client = await pool.connect();
  try {
    await client.query("BEGIN");

    const docRes = await client.query<{ id: number }>(
      `INSERT INTO documents (title, source_url) VALUES ($1, $2) RETURNING id`,
      [input.title, input.sourceUrl ?? null],
    );
    const documentId = docRes.rows[0].id;

    for (let i = 0; i < chunks.length; i++) {
      await client.query(
        `INSERT INTO chunks (document_id, chunk_index, content, embedding)
         VALUES ($1, $2, $3, $4)`,
        [documentId, chunks[i].index, chunks[i].content, toVectorLiteral(vectors[i])],
      );
    }

    await client.query("COMMIT");
    return { documentId, chunkCount: chunks.length };
  } catch (err) {
    await client.query("ROLLBACK");
    throw err;
  } finally {
    client.release();
  }
}

The transaction is deliberate. A document with half its chunks written is worse than no document, because retrieval will confidently return the chunks that made it in and silently miss the rest. Wrapping the whole thing in BEGIN/COMMIT means a document is either fully ingested or not at all.

You'd call this from a script, an admin route, or a background job. For a first run, a tiny script over a folder of text files is enough:

TypeScript

// scripts/ingest.ts
import { readFile, readdir } from "node:fs/promises";
import { join } from "node:path";
import { ingestDocument } from "../lib/ingest";

const DOCS_DIR = "./docs";

for (const file of await readdir(DOCS_DIR)) {
  if (!file.endsWith(".txt") && !file.endsWith(".md")) continue;
  const text = await readFile(join(DOCS_DIR, file), "utf8");
  const { chunkCount } = await ingestDocument({ title: file, text });
  console.log(`Ingested ${file}: ${chunkCount} chunks`);
}

Step 6: Retrieval

This is the query that does the actual work. We embed the question, then ask Postgres for the chunks whose embeddings are closest by cosine distance.

TypeScript

// lib/retrieve.ts
import { Pool } from "pg";
import { embed } from "./embed";

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export interface RetrievedChunk {
  chunkId: number;
  documentId: number;
  title: string;
  content: string;
  distance: number;
}

export async function retrieve(query: string, k = 5): Promise<RetrievedChunk[]> {
  const [queryVec] = await embed([query]);
  const literal = `[${queryVec.join(",")}]`;

  const res = await pool.query<RetrievedChunk>(
    `SELECT
        c.id          AS "chunkId",
        c.document_id AS "documentId",
        d.title       AS "title",
        c.content     AS "content",
        c.embedding <=> $1 AS "distance"
     FROM chunks c
     JOIN documents d ON d.id = c.document_id
     ORDER BY c.embedding <=> $1
     LIMIT $2`,
    [literal, k],
  );

  return res.rows;
}

The <=> operator is pgvector's cosine distance. Smaller means more similar, a distance of 0 is identical direction, larger numbers are further apart. Cosine distance compares the direction of two vectors and ignores their magnitude, which is what you want for text similarity. (pgvector also has <-> for L2 and <#> for inner product. Use the one whose *_ops you indexed with, we built a vector_cosine_ops index, so we use <=>.)

ORDER BY ... LIMIT k is what lets Postgres use the ivfflat index. The whole reason the index exists is to answer "give me the k nearest" without scanning every row. k = 5 is a sane default; more chunks give Claude more to work with but cost more tokens and can dilute the signal with marginally-relevant text.

Step 7: Asking Claude, with citations

Now we hand the retrieved chunks to Claude and ask it to answer using only those chunks, and to cite them. The trick that makes citations reliable is to label each chunk with a number in the prompt and tell Claude to reference those numbers. Then we map the numbers back to real documents on our side.

TypeScript

// lib/answer.ts
import Anthropic from "@anthropic-ai/sdk";
import { retrieve, type RetrievedChunk } from "./retrieve";

const client = new Anthropic();

export interface Answer {
  text: string;
  sources: { ref: number; title: string; chunkId: number }[];
}

function buildContext(chunks: RetrievedChunk[]): string {
  return chunks
    .map((c, i) => `[${i + 1}] (from "${c.title}")\n${c.content}`)
    .join("\n\n---\n\n");
}

export async function answerQuestion(question: string): Promise<Answer> {
  const chunks = await retrieve(question, 5);

  if (chunks.length === 0) {
    return { text: "I don't have any documents to answer from yet.", sources: [] };
  }

  const context = buildContext(chunks);

  const system =
    "You answer questions using only the numbered context provided. " +
    "Cite the sources you use with bracketed numbers like [1] or [2], placed " +
    "right after the claim they support. If the context does not contain the " +
    "answer, say so plainly instead of guessing.";

  const msg = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system,
    messages: [
      {
        role: "user",
        content: `Context:\n\n${context}\n\n---\n\nQuestion: ${question}`,
      },
    ],
  });

  const text = msg.content[0].type === "text" ? msg.content[0].text : "";

  const sources = chunks.map((c, i) => ({
    ref: i + 1,
    title: c.title,
    chunkId: c.chunkId,
  }));

  return { text, sources };
}

The system prompt is doing two jobs. "Use only the numbered context" is what keeps Claude grounded in your documents instead of answering from its own training. "If the context does not contain the answer, say so" is what stops it from inventing one when retrieval comes back empty-handed, which it will, sometimes, and a confident wrong answer is worse than "I don't know."

A few details that matter. The model id is claude-sonnet-4-6, the current Sonnet. max_tokens caps the response length. The response comes back as a list of content blocks, so we check msg.content[0].type === "text" before reading .text, blindly reading .text will bite you the day a non-text block shows up. And we return the source list alongside the answer so the UI can render the [1], [2] markers as real links back to documents.

Step 8: The API route

Tie it together in a Next.js route handler.

TypeScript

// app/api/chat/route.ts
import { NextResponse } from "next/server";
import { answerQuestion } from "@/lib/answer";

export async function POST(req: Request) {
  const { question } = await req.json();

  if (typeof question !== "string" || question.trim().length === 0) {
    return NextResponse.json({ error: "Missing question." }, { status: 400 });
  }

  const answer = await answerQuestion(question);
  return NextResponse.json(answer);
}

Call it and you get back the answer text plus a sources array you can render however you like:

Shell

curl -s localhost:3000/api/chat \
  -H 'content-type: application/json' \
  -d '{"question":"What is the refund window?"}'

JSON

{
  "text": "The refund window is 30 days from purchase [1]. Refunds are issued to the original payment method [2].",
  "sources": [
    { "ref": 1, "title": "refund-policy.md", "chunkId": 412 },
    { "ref": 2, "title": "refund-policy.md", "chunkId": 413 }
  ]
}

That's a working RAG chatbot. Ingest your documents, hit the route, and you get answers with citations pointing back at the chunks they came from.

Gotchas

The happy path above works. Here's what I've learned the hard way about the parts that don't show up until later.

Chunking is where accuracy is won or lost

The chunker here splits on blank lines and that's fine for prose. But chunk size is a real tradeoff, not a default to ignore. Chunks too large and each vector blurs across several topics, so retrieval gets fuzzy. Too small and a chunk loses the surrounding context that made it meaningful. If retrieval feels off, this is the first dial to turn, before you touch anything else. For structured content, code, tables, transcripts, split on the structure's real boundaries, not on blank lines.

Approximate indexes miss things

ivfflat is approximate. It only searches the buckets nearest your query, so occasionally the true nearest neighbor lives in a bucket it didn't open and your result is slightly worse than a brute-force scan would give. You can widen the search with SET ivfflat.probes = 10; (more buckets searched, slower, better recall). If recall really matters and your data fits, an hnsw index gives better recall at the cost of slower builds and more memory. And remember to build the index after loading data, ivfflat learns its buckets from whatever rows exist at build time.

Cost adds up in two places

Every ingested chunk is one embedding call's worth of tokens, and you pay again to re-embed if you change models or chunking. Every question pays for the query embedding plus the Claude call, and the Claude call's input includes all k chunks you stuffed into the prompt. Raising k from 5 to 20 quadruples the context you're paying for on every single question. Embeddings are cheap; Claude input tokens at scale are not. Pick k deliberately.

Retrieval can come back empty or wrong, and the model will cover for you if you let it

The biggest failure mode in RAG isn't a crash, it's a confident answer built on chunks that don't actually contain the answer. Two defenses. First, the system prompt instruction to say "I don't know" when the context doesn't cover the question, keep it, and test that it actually fires. Second, you have the cosine distances in hand; if the nearest chunk is past some distance threshold, you can treat retrieval as a miss and short-circuit before calling Claude at all. Don't trust that "documents exist" means "the answer is in them."

Keep your embeddings consistent

Documents and queries must be embedded by the same model. The day you upgrade the embedding model, every stored vector is from the old model and your new query vectors won't sit in the same space, distances become noise. Re-embed everything when you change models, and keep the model name and dimension pinned in one place (the embed.ts constants) so the migration is one obvious edit, not a hunt.

Wrapping up

The core of RAG is smaller than the ecosystem around it suggests: chunk, embed, store, retrieve by distance, answer with the retrieved text as context. Postgres and pgvector cover the storage and search without a separate service, and Claude handles the answer with citations falling out of a numbered-context prompt.

Where I'd go next: stream the response so the answer renders as it's written, add metadata filters to the retrieval query (date, source, author) so you can scope which documents are searchable, and put a real eval harness in front of it, a set of questions with known-good answers, so you can change the chunker or k and actually measure whether retrieval got better instead of guessing. That eval harness is the difference between tuning RAG and flailing at it.