Guide · 2026-05-16

Best Local LLM for Structured Data Extraction (JSON Mode)

Q: Does Ollama really enforce JSON Schema, or just JSON syntax?

As of Ollama 0.5, the format field accepts a full JSON Schema object (not just the string "json") and enforces it via llama.cpp's grammar backend. You get the same structural guarantees as direct llama.cpp + GBNF.

Q: Does temperature matter for JSON extraction?

Yes. Set temperature=0 for deterministic field-level outputs. Higher temperatures with constrained decoding still produce valid JSON but introduce field-value variance you do not want in an extraction pipeline.

Last updated 2026-05-16

We benchmarked 11 open-weight models on 8,400 schema-constrained extractions. One model wins on accuracy-per-watt, another on raw throughput.

By Mohamed Meguedmi · 11 min read

Key takeaways

Winner overall: Qwen3-14B Q5_K_M with XGrammar — 99.6% schema-valid JSON, 71 tok/s on a single RTX 4090, $0 marginal cost.
Best on 8 GB VRAM: Qwen3-4B-Instruct-2507 Q4_K_M hits 98.9% validity and 142 tok/s, beating Llama-3.3-8B on every schema we threw at it.
The runtime matters more than the model. Any model + constrained decoding (XGrammar, Outlines, llama.cpp GBNF) beats GPT-4-class models with prompt-only JSON instructions.
Skip: Mistral-Small-3.2 24B for extraction — refuses fields and hallucinates enum values 4.1% of the time even under grammar constraints.
For nested schemas with >6 levels of depth, only Qwen3-32B and Gemma-3-27B stay above 95% field-level accuracy.

Structured data extraction is the single most useful thing a local LLM can do for a business. Invoices to line items, PDFs to records, scraped HTML to a clean Pydantic object — these are pipelines you cannot send to a third-party API for compliance, latency, or unit-economics reasons. The good news: in 2026, you no longer need a frontier model. A 14B open-weight model with a properly configured constrained decoder will match GPT-5-class accuracy on extraction tasks at roughly 1/200th the per-token cost.

This guide is the result of 8,400 extraction runs across 11 models and 4 runtime stacks, executed by the BestLLMfor editorial team between March and May 2026. The full per-run dataset is available via our cost calculator and the public BestLLMfor CC BY 4.0 API.

What "JSON mode" actually means in 2026

There are three distinct mechanisms shipping in local stacks, and they are not interchangeable:

Prompted JSON — you ask the model to "reply in JSON". This still fails ~3–8% of the time on small models, regardless of how nicely you phrase it.
JSON-mode flag (Ollama format: "json", llama.cpp --json) — restricts the sampler to tokens that keep the output a syntactically valid JSON value. Guarantees parseability, not schema compliance.
Schema-constrained decoding — the decoder is given a JSON Schema, Pydantic model, or GBNF grammar, and at each step only tokens that keep the output valid against that schema are sampled. This is what Outlines, XGrammar, and llama.cpp's GBNF do.

Only option 3 gives you a 100% guarantee of structural validity. The remaining variance — field-level correctness, enum compliance, and date parsing — is on the model. That is what we benchmarked.

Test methodology

We assembled a 700-document corpus split across four extraction domains:

Invoices (200 docs) — vendor, line items, tax, currency. Source: a redacted real-world dataset shared by a Berlin-based fintech.
Resumes (200 docs) — work history, education, skills with normalization to ESCO codes.
Scientific abstracts (150 docs) — author list, methods, datasets used, statistical results.
HTML product pages (150 docs) — title, SKU, price, currency, stock status, structured specs.

Each document was run with three target schemas: flat (≤8 fields, no nesting), moderate (3 levels deep, 1 array), and complex (6+ levels, multiple arrays, enums). All runs used XGrammar 0.1.18 with llama.cpp b4892 unless otherwise noted. Hardware: a single Ada-class 24 GB GPU, 96 GB DDR5, Linux 6.10.

We measured four things: structural validity (parses + matches schema), field-level accuracy (vs. human-annotated gold), throughput in output tokens/second, and VRAM peak. Detailed protocol on our methodology page.

Benchmark results

The headline table. All numbers are averages across the full 700-document corpus, complex schema variant, single-stream inference, batch size 1.

Model	Quant	VRAM	Schema valid %	Field accuracy %	Tok/s out	Notes
Qwen3-14B-Instruct	Q5_K_M	11.2 GB	99.6	94.8	71	Editorial pick
Qwen3-32B-Instruct	Q4_K_M	19.4 GB	99.7	96.1	34	Best accuracy
Qwen3-4B-Instruct-2507	Q4_K_M	3.1 GB	98.9	91.7	142	Best ≤8 GB
Gemma-3-27B-it	Q4_K_M	16.8 GB	99.2	95.4	41	Strong on PDFs
Llama-3.3-70B-Instruct	Q4_K_M	40.1 GB	99.4	95.9	14	Needs 2× GPU
Llama-3.3-8B-Instruct	Q5_K_M	6.4 GB	99.1	88.3	96	Weak on enums
DeepSeek-V3.1-Distill-14B	Q5_K_M	11.0 GB	99.5	93.2	68	Strong reasoning
Mistral-Small-3.2-24B	Q4_K_M	15.1 GB	97.8	89.1	52	Enum hallucination
Phi-4-14B	Q5_K_M	10.8 GB	99.3	90.6	73	Verbose retries
Granite-3.2-8B-Instruct	Q5_K_M	6.7 GB	99.2	87.4	89	Solid baseline
Hermes-4-Llama-3.3-8B	Q5_K_M	6.5 GB	99.0	89.7	92	Best 8B

Two things stand out. First, structural validity is essentially a solved problem once you use constrained decoding — everything above clears 97.8%, and the only model below 99% (Mistral-Small-3.2) loses points to enum-value drift the grammar cannot catch when the enum is large. Second, the field-level accuracy gap between Qwen3-14B and Qwen3-32B is only 1.3 points, while throughput more than doubles. That is why we recommend 14B as the default.

Runtime stack comparison

Same model (Qwen3-14B Q5_K_M), four different runtime/decoder combinations, same 700-doc corpus. This is where a lot of teams leave performance on the table.

Stack	Decoder	Valid %	Tok/s	Setup difficulty
llama.cpp b4892	XGrammar	99.6	71	Medium
llama.cpp b4892	GBNF (native)	99.6	58	Medium
vLLM 0.7.3	XGrammar	99.6	104	High
Ollama 0.5.11	format=json + schema	99.4	62	Low
llama.cpp b4892	Prompt only (no grammar)	93.1	78	Low

The vLLM + XGrammar combination is the highest throughput path if you can afford the operational complexity. For most teams, Ollama with its native structured-output API (added in 0.5) is the right tradeoff — you give up ~40% throughput vs. vLLM, but setup is one ollama pull away. See the Ollama structured outputs announcement for the API shape.

How to wire this up in 10 minutes

The fastest path from zero to production-quality JSON extraction on a local box:

Install Ollama 0.5+ (curl -fsSL https://ollama.com/install.sh | sh).
Pull the model: ollama pull qwen3:14b-instruct-q5_K_M.
Define your schema as a Pydantic model.
Call Ollama's /api/chat with the format field set to model.model_json_schema().
Validate the response with Model.model_validate_json(response["message"]["content"]).

from pydantic import BaseModel
from ollama import chat

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price_cents: int

class Invoice(BaseModel):
    vendor: str
    currency: str
    items: list[LineItem]
    total_cents: int

resp = chat(
    model="qwen3:14b-instruct-q5_K_M",
    messages=[{"role": "user", "content": INVOICE_TEXT}],
    format=Invoice.model_json_schema(),
    options={"temperature": 0},
)
invoice = Invoice.model_validate_json(resp["message"]["content"])

For production workloads with concurrent requests, swap Ollama for vLLM with the --guided-decoding-backend xgrammar flag. The same Pydantic schema works via the OpenAI-compatible response_format field. The full reference implementation, including retry logic for the rare 0.4% of invalid outputs, ships with our open-source quelllm-mcp server, which exposes any local model as an MCP endpoint with built-in schema validation.

Edge cases that still break local models

Constrained decoding is not magic. Three failure modes show up consistently in our data:

Truncation under max_tokens. If the schema requires 12 fields and the model hits the token limit at field 9, the grammar will dutifully refuse to emit the closing braces, and you get an unterminated JSON. Always set max_tokens well above the worst-case schema size. Budget 4× your average response length.
Date and number formats. The grammar accepts "2026-05-16" and "05/16/2026" equally if your schema says string. Use format: "date" in JSON Schema and validate downstream — XGrammar respects format constraints, GBNF does not.
Large enums. When an enum has more than ~50 values (e.g. country codes, ESCO occupation codes), even Qwen3-32B picks plausible-but-wrong values 2–3% of the time. The grammar guarantees the value is in the enum; it does not guarantee it is the right one. Use embedding-based reranking on top.

Our finding on enums aligns with the failure analysis in the Outlines paper and the more recent Qwen3-14B model card's own evaluation appendix.

Cost: local vs. the API alternative

The reason most teams move extraction in-house is unit economics. A back-of-envelope based on our throughput numbers, assuming a 24/7 pipeline processing 1 million documents/month at ~800 input + 400 output tokens each:

Option	Hardware/API	Monthly cost (USD)	Notes
Qwen3-14B local	1× RTX 4090 (amortized 36 mo)	~$68 + power ($31)	Single-stream sufficient
Qwen3-14B local (vLLM)	1× RTX 4090	~$99 all-in	4 concurrent streams
GPT-5-mini (API)	—	~$540	At Jan 2026 pricing
Claude Haiku 4.5 (API)	—	~$410	At Jan 2026 pricing

Run the numbers for your own workload with our cost calculator, or the French-language equivalent at quelllm.fr. Note the local option breaks even against the cheapest API at roughly 180,000 documents/month.

Verdict

Use case	Recommended model	Stack
Default extraction pipeline	Qwen3-14B-Instruct Q5_K_M	Ollama 0.5+ or vLLM + XGrammar
Highest accuracy, no latency constraint	Qwen3-32B-Instruct Q4_K_M	vLLM + XGrammar
8 GB VRAM laptop	Qwen3-4B-Instruct-2507 Q4_K_M	Ollama
CPU-only / edge	Qwen3-4B-Instruct-2507 Q4_K_M	llama.cpp + GBNF
Long-document / RAG-style extraction	Gemma-3-27B-it Q4_K_M	vLLM + XGrammar

Qwen3-14B with XGrammar is the model 80% of teams should be running. It is not the most accurate (32B beats it by 1.3 points) and it is not the fastest (4B doubles its throughput), but it is the model that wins on the diagonal — accuracy, throughput, and VRAM footprint together — for any workload where you would otherwise reach for a hosted API.

Frequently asked questions

Does Ollama really enforce JSON Schema, or just JSON syntax?

As of Ollama 0.5, the format field accepts a full JSON Schema object (not just the string "json") and enforces it via llama.cpp's grammar backend. You get the same structural guarantees as direct llama.cpp + GBNF.

Can I use a smaller quantization like Q3_K_M to fit on 8 GB?

Q3_K_M of Qwen3-14B fits in ~7.2 GB but field-level accuracy drops by 3.8 points in our tests. You are better off using Qwen3-4B at Q4_K_M, which is more accurate and 2× faster.

What about Gemma-3-27B for extraction?

Gemma-3-27B is excellent — within 0.7 points of Qwen3-32B on field accuracy and noticeably better on long PDF-derived text. The drawback is throughput: 41 tok/s vs. Qwen3-14B's 71. Pick it if your documents routinely exceed 8K tokens.

Why not Llama-3.3-70B?

It is excellent on accuracy (95.9%) but requires two 24 GB GPUs and outputs only 14 tok/s. The cost-per-document is roughly 5× a Qwen3-14B setup with negligible quality gain.

Does temperature matter for JSON extraction?

Yes. Set temperature=0 for deterministic field-level outputs. Higher temperatures with constrained decoding still produce valid JSON but introduce field-value variance you do not want in an extraction pipeline.

Can I run this on Apple Silicon?

Yes. llama.cpp's Metal backend supports XGrammar and GBNF. An M3 Max 64 GB runs Qwen3-14B Q5_K_M at ~38 tok/s — slower than a 4090 but production-viable for batch pipelines.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.