Tutorial // Document AI2026-06-2013 min read

Build a PDF Data Extraction Pipeline with an LLM

Turn messy PDFs into clean, validated JSON with Python and Claude: parsing, structured prompts, Pydantic validation, and confidence checks.

Varun Raj ManoharanFounder & Principal Engineer

Document AIPythonClaudeData ExtractionTutorial

Key takeaways

Split the work so the LLM reads messy PDF text while a Pydantic schema validates every field before you trust any value.
Use Decimal rather than float for money so summing line items against a total does not surface rounding errors on correct invoices.
Generate the JSON schema from the Pydantic model into the prompt so there is no drift between what you ask for and what you validate.
Ask the model to rate per-field confidence and re-verify low-confidence fields with a source quote to find the small share needing human review.

PDFs are where structured data goes to die. Someone built an invoice in a layout tool, exported it, and now you have a file that looks tidy on screen but is a tangle of absolutely-positioned text fragments underneath. Two invoices from the same vendor can serialize their text in a completely different order. Regex and template matching get you maybe 70% of the way there, and then a vendor changes their template and your pipeline breaks at 2am.

I've built a few of these pipelines now, and the approach that has held up is to stop fighting the layout. Pull the raw text out of the PDF, hand it to an LLM with a strict schema, and validate whatever comes back against that schema before you trust a single field. The model is good at reading messy text; Pydantic is good at refusing to let bad data through. That division of labor is the whole trick.

In this tutorial we'll build an invoice extractor that produces validated JSON. By the end you'll have something that takes a PDF and gives you back a typed object with a vendor name, line items, totals, and a per-field confidence flag so you know which values to double-check. The same structure works for purchase orders, lab reports, contracts, intake forms, anything with a predictable set of fields.

What you'll need

Python 3.10 or later.
An Anthropic API key, set as ANTHROPIC_API_KEY in your environment.
A few sample PDFs to test against. Real ones, not clean synthetic ones, you want the warts.
These packages:

Shell

pip install anthropic pypdf pydantic

We'll use pypdf for text extraction, the anthropic SDK to call Claude, and Pydantic v2 for the schema and validation. If you're dealing with scanned documents you'll also want pytesseract and the Tesseract binary, but hold off on that until the Gotchas section, most digital PDFs don't need OCR, and reaching for it first is a common waste of time.

Step 1: Extract the text

First, get the words out of the PDF. pypdf walks each page and pulls the embedded text layer.

Python

from pypdf import PdfReader


def extract_text(pdf_path: str) -> str:
    reader = PdfReader(pdf_path)
    pages = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        pages.append(f"--- Page {i + 1} ---\n{text}")
    return "\n\n".join(pages)

A couple of things worth noticing. I prefix each page with a marker. That sounds trivial, but it helps the model when an invoice spans pages, it can tell that a line item on page 2 belongs to the same document, and you'll thank yourself later when you ask it to cite where a value came from. The or "" guard matters too: extract_text() returns None, not an empty string, when a page has no extractable text. Concatenating None will blow up, and a page that returns nothing is your first signal that you might be looking at a scan rather than a real text layer.

Run this on a sample and actually read the output. It will be ugly, fields out of order, labels separated from their values, whitespace doing strange things. That's fine. We're not parsing it ourselves. We just need the words present and roughly in reading order.

Step 2: Define the schema with Pydantic

Before we write a single line of prompt, we decide what "extracted" means. This is the part people skip, and it's the part that makes the whole thing reliable. The schema is the contract: it's what we tell the model to produce, and it's what we validate against afterward.

Python

from datetime import date
from decimal import Decimal
from typing import Optional

from pydantic import BaseModel, Field


class LineItem(BaseModel):
    description: str
    quantity: Decimal
    unit_price: Decimal
    amount: Decimal


class Invoice(BaseModel):
    vendor_name: str = Field(description="Company that issued the invoice")
    invoice_number: str
    invoice_date: Optional[date] = Field(
        default=None, description="Date the invoice was issued, ISO 8601"
    )
    line_items: list[LineItem] = Field(default_factory=list)
    subtotal: Optional[Decimal] = None
    tax: Optional[Decimal] = None
    total: Decimal

I'm using Decimal for money rather than float, and I'd push back hard on anyone who wants to use float here. Floating point can't represent 19.99 exactly, and the moment you start summing line items to check a total, the rounding errors will make a correct invoice look wrong. Decimal keeps the arithmetic honest.

Notice which fields are optional and which aren't. vendor_name, invoice_number, and total have no default, if the model can't find them, validation should fail loudly, because an invoice without a total isn't an invoice. But invoice_date, subtotal, and tax are genuinely sometimes absent (cash invoices, tax-exempt vendors), so they're Optional. Getting this distinction right is most of the battle. Mark something required that's legitimately missing and you'll get false failures; mark something optional that should always be there and bad extractions sail through.

The Field descriptions aren't decoration. We're going to feed the JSON schema generated from this model straight into the prompt, and those descriptions become instructions to the model. "Date the invoice was issued, ISO 8601" tells Claude both what the field means and what format you want.

Step 3: Build the extraction prompt

Now we write the prompt that turns text into structured output. The schema does the heavy lifting, so the prompt itself can stay short and direct.

Python

import json

PROMPT_TEMPLATE = """You are extracting structured data from an invoice.

Below is the raw text extracted from a PDF. The layout may be jumbled
because PDF text extraction doesn't always preserve reading order.

Return a single JSON object that conforms exactly to this schema:

{schema}

Rules:
- Use only information present in the text. Do not invent values.
- If a field isn't present, use null (for optional fields).
- Amounts are numbers without currency symbols or thousands separators.
- Dates are ISO 8601 (YYYY-MM-DD).
- Respond with the JSON object only. No prose, no markdown fences.

Invoice text:
{text}"""


def build_prompt(text: str) -> str:
    schema = json.dumps(Invoice.model_json_schema(), indent=2)
    return PROMPT_TEMPLATE.format(schema=schema, text=text)

Invoice.model_json_schema() generates a JSON Schema document from the Pydantic model, types, required fields, descriptions, the whole structure. Embedding it in the prompt means the schema lives in exactly one place. When you add a field to the model, the prompt updates itself. No drift between "what I asked for" and "what I validate against," which is a class of bug that's miserable to debug.

The rules are deliberately about the failure modes I've actually seen. "Do not invent values" fights the model's instinct to be helpful and fill gaps. The bit about currency symbols and separators heads off "$1,299.00" strings that won't parse as Decimal. And asking for raw JSON with no markdown fences saves a fragile post-processing step, though we'll handle stray fences anyway in the next step, because models don't always listen.

Step 4: Call Claude

Here's the API call. The Anthropic SDK reads your key from the environment, so the client takes no arguments.

Python

import anthropic

client = anthropic.Anthropic()


def call_model(prompt: str) -> str:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text

That's the entire call. The response text lives at message.content[0].text, content is a list of blocks, and for a plain text reply the first block is the one you want.

A note on max_tokens: 2048 is comfortable for a typical invoice, but a multi-page document with thirty line items can produce a lot of JSON. If you see output that's valid JSON right up until it gets truncated mid-object, that's the token limit biting, not a model failure. Bump it up. You can check message.stop_reason, if it comes back as "max_tokens", the response was cut off and you should retry with more room rather than try to parse a half-finished object.

I'm keeping this as a single non-streaming call because invoice JSON is small and fast. If you were extracting from a 200-page report you'd want to stream, but that's not this problem.

Step 5: Parse and validate

This is where Pydantic earns its place. The model returns a string that claims to be JSON matching our schema. We don't take that on faith.

Python

from pydantic import ValidationError


def parse_invoice(raw: str) -> Invoice:
    cleaned = raw.strip()
    # Models sometimes wrap JSON in markdown fences despite instructions.
    if cleaned.startswith("```"):
        cleaned = cleaned.split("```")[1]
        if cleaned.startswith("json"):
            cleaned = cleaned[len("json"):]
        cleaned = cleaned.strip()

    return Invoice.model_validate_json(cleaned)

model_validate_json does two jobs at once: it parses the JSON string and validates the result against the Invoice model, coercing types as it goes. "2026-06-20" becomes a date object. "1299.00" becomes a Decimal. If the model returned a string where a number belongs, or left out a required field, or produced a date that isn't a real date, you get a ValidationError with a precise description of what's wrong and where.

That fence-stripping block is defensive cruft, and I'm not happy it's there, but in practice models occasionally wrap output in ```json despite being told not to. Stripping it is cheaper than a retry. If you find your model never does this, delete the block, don't carry code you don't need.

The reason this step is non-negotiable: an LLM will confidently return data that's the wrong shape. Not often, but often enough that "it worked on my five test files" is not a guarantee. Validation converts "the model lied to me" from a silent data-corruption bug, the worst kind, the kind you discover three weeks later in a report, into a loud exception you handle right now.

Python

def extract_invoice(pdf_path: str) -> Invoice:
    text = extract_text(pdf_path)
    prompt = build_prompt(text)
    raw = call_model(prompt)
    return parse_invoice(raw)

Four functions, one pipeline. PDF in, validated Invoice out, or an exception if something's wrong.

Step 6: Confidence and a retry for the shaky fields

Validation tells you the data is well-formed. It doesn't tell you it's correct. A model can return a perfectly typed Decimal for the total that happens to be wrong because the text was garbled. So the last piece is asking the model to flag its own uncertainty, and re-checking the fields it's unsure about.

I add a parallel confidence map to the extraction. Rather than bolt it onto the Invoice model, I ask for it as a separate object so the clean data stays clean.

Python

from typing import Literal

Confidence = Literal["high", "medium", "low"]


class ExtractionResult(BaseModel):
    invoice: Invoice
    confidence: dict[str, Confidence] = Field(default_factory=dict)

Then I extend the prompt to ask for the confidence map alongside the data, "for each top-level field, rate your confidence as high, medium, or low based on how clearly the value appears in the text." Fields the model had to guess at, or reconstruct from a jumbled total line, come back as low.

The useful move is to act on that. Collect the low-confidence fields and ask the model to look again, this time focused only on those:

Python

def low_confidence_fields(result: ExtractionResult) -> list[str]:
    return [field for field, level in result.confidence.items() if level == "low"]


def reverify(text: str, fields: list[str]) -> dict:
    prompt = (
        f"From the invoice text below, carefully re-read and report only "
        f"these fields: {', '.join(fields)}. "
        f"Quote the exact line each value comes from, then give the value. "
        f"Respond as JSON mapping each field name to its value.\n\n"
        f"Invoice text:\n{text}"
    )
    raw = call_model(prompt)
    return json.loads(parse_inner_json(raw))

Asking it to quote the source line is the part that matters. It forces the model to ground each value in the actual text instead of restating its earlier guess, and it gives you something auditable. If the re-verified value disagrees with the first pass, that field is exactly the one a human should glance at before the data goes anywhere important. You're not trying to reach 100% automated accuracy, you're trying to know precisely which 3% of fields to look at, which is a far more achievable and honestly more useful goal.

(parse_inner_json here is just the fence-stripping logic from Step 5 factored out so both call sites share it.)

For a lot of use cases you can stop at Step 5. The confidence pass is worth adding when the cost of a silently wrong value is high, paying the wrong amount, filing the wrong number, and cheap to skip when you're doing something forgiving like search indexing.

Gotchas

A few things that will bite you, roughly in the order they'll bite.

Scanned PDFs have no text to extract. If extract_text returns empty or near-empty strings, the PDF is probably an image, a scan or a photo, with no text layer. pypdf can't help here; there's nothing for it to read. You need OCR. Render each page to an image and run it through Tesseract:

Python

import pytesseract
from pdf2image import convert_from_path


def ocr_text(pdf_path: str) -> str:
    images = convert_from_path(pdf_path, dpi=300)
    return "\n\n".join(pytesseract.image_to_string(img) for img in images)

OCR output is noisier than a real text layer, expect the occasional 1 read as l, 0 as O. The validation and confidence steps matter even more here, because the model is now reading text that itself contains errors. A cleaner alternative for scans is to skip the text layer entirely and send the page images to Claude directly using its vision support, but that's a different pipeline and a longer post.

The model will hallucinate fields under pressure. If the text genuinely doesn't contain an invoice number and your prompt nudges hard enough, the model may produce a plausible-looking one. The defenses are layered: "use only information present" in the prompt, the schema allowing null for things that can be absent, and the re-verification step that demands a source quote. A "field" the model can't quote a line for is a field it invented.

Long documents hit token limits. Both the input (a huge PDF stuffed into one prompt) and the output (lots of line items) can exceed limits. For input, you can extract page by page and merge, or summarize boilerplate before sending. For output, watch stop_reason for "max_tokens" and raise max_tokens rather than parsing a truncated object, half-valid JSON is worse than a clean error.

Validation failures are information, not just errors. When ValidationError fires, don't just retry blindly. Read it. "total: Input should be a valid decimal" usually means the text had "$1,299" and a prompt rule didn't catch it, tighten the prompt. "invoice_number: Field required" on documents that genuinely lack one means your schema is too strict, make it Optional. The exceptions are telling you where your model of the data and the actual data disagree, and that's worth listening to.

Wrapping up

The shape of this pipeline is the lesson: extract text, define a strict schema, let the model fill it, validate hard, and flag uncertainty. The LLM handles the reading, which it's good at. Pydantic handles the trust, which the LLM is bad at. Keeping those two jobs separate is what turns a demo that works on five files into something you'd actually run against a thousand.

Swap the Invoice model for whatever you're extracting and most of this carries over unchanged. The text extraction, the schema-in-the-prompt trick, the validation gate, the confidence re-check, none of it is invoice-specific. That's the part worth keeping.