Best Local LLM for Structured Data Extraction (JSON Mode)
We benchmarked 11 open-weight models on 8,400 schema-constrained extractions. One model wins on accuracy-per-watt, another on raw throughput.
By Mohamed Meguedmi · 11 min read
Key takeaways
- Winner overall: Qwen3-14B Q5_K_M with XGrammar — 99.6% schema-valid JSON, 71 tok/s on a single RTX 4090, $0 marginal cost.
- Best on 8 GB VRAM: Qwen3-4B-Instruct-2507 Q4_K_M hits 98.9% validity and 142 tok/s, beating Llama-3.3-8B on every schema we threw at it.
- The runtime matters more than the model. Any model + constrained decoding (XGrammar, Outlines, llama.cpp GBNF) beats GPT-4-class models with prompt-only JSON instructions.
- Skip: Mistral-Small-3.2 24B for extraction — refuses fields and hallucinates enum values 4.1% of the time even under grammar constraints.
- For nested schemas with >6 levels of depth, only Qwen3-32B and Gemma-3-27B stay above 95% field-level accuracy.
Structured data extraction is the single most useful thing a local LLM can do for a business. Invoices to line items, PDFs to records, scraped HTML to a clean Pydantic object — these are pipelines you cannot send to a third-party API for compliance, latency, or unit-economics reasons. The good news: in 2026, you no longer need a frontier model. A 14B open-weight model with a properly configured constrained decoder will match GPT-5-class accuracy on extraction tasks at roughly 1/200th the per-token cost.
This guide is the result of 8,400 extraction runs across 11 models and 4 runtime stacks, executed by the BestLLMfor editorial team between March and May 2026. The full per-run dataset is available via our cost calculator and the public BestLLMfor CC BY 4.0 API.
What "JSON mode" actually means in 2026
There are three distinct mechanisms shipping in local stacks, and they are not interchangeable:
- Prompted JSON — you ask the model to "reply in JSON". This still fails ~3–8% of the time on small models, regardless of how nicely you phrase it.
- JSON-mode flag (Ollama
format: "json", llama.cpp--json) — restricts the sampler to tokens that keep the output a syntactically valid JSON value. Guarantees parseability, not schema compliance. - Schema-constrained decoding — the decoder is given a JSON Schema, Pydantic model, or GBNF grammar, and at each step only tokens that keep the output valid against that schema are sampled. This is what Outlines, XGrammar, and llama.cpp's GBNF do.
Only option 3 gives you a 100% guarantee of structural validity. The remaining variance — field-level correctness, enum compliance, and date parsing — is on the model. That is what we benchmarked.
Test methodology
We assembled a 700-document corpus split across four extraction domains:
- Invoices (200 docs) — vendor, line items, tax, currency. Source: a redacted real-world dataset shared by a Berlin-based fintech.
- Resumes (200 docs) — work history, education, skills with normalization to ESCO codes.
- Scientific abstracts (150 docs) — author list, methods, datasets used, statistical results.
- HTML product pages (150 docs) — title, SKU, price, currency, stock status, structured specs.
Each document was run with three target schemas: flat (≤8 fields, no nesting), moderate (3 levels deep, 1 array), and complex (6+ levels, multiple arrays, enums). All runs used XGrammar 0.1.18 with llama.cpp b4892 unless otherwise noted. Hardware: a single Ada-class 24 GB GPU, 96 GB DDR5, Linux 6.10.
We measured four things: structural validity (parses + matches schema), field-level accuracy (vs. human-annotated gold), throughput in output tokens/second, and VRAM peak. Detailed protocol on our methodology page.
Benchmark results
The headline table. All numbers are averages across the full 700-document corpus, complex schema variant, single-stream inference, batch size 1.
| Model | Quant | VRAM | Schema valid % | Field accuracy % | Tok/s out | Notes |
|---|---|---|---|---|---|---|
| Qwen3-14B-Instruct | Q5_K_M | 11.2 GB | 99.6 | 94.8 | 71 | Editorial pick |
| Qwen3-32B-Instruct | Q4_K_M | 19.4 GB | 99.7 | 96.1 | 34 | Best accuracy |
| Qwen3-4B-Instruct-2507 | Q4_K_M | 3.1 GB | 98.9 | 91.7 | 142 | Best ≤8 GB |
| Gemma-3-27B-it | Q4_K_M | 16.8 GB | 99.2 | 95.4 | 41 | Strong on PDFs |
| Llama-3.3-70B-Instruct | Q4_K_M | 40.1 GB | 99.4 | 95.9 | 14 | Needs 2× GPU |
| Llama-3.3-8B-Instruct | Q5_K_M | 6.4 GB | 99.1 | 88.3 | 96 | Weak on enums |
| DeepSeek-V3.1-Distill-14B | Q5_K_M | 11.0 GB | 99.5 | 93.2 | 68 | Strong reasoning |
| Mistral-Small-3.2-24B | Q4_K_M | 15.1 GB | 97.8 | 89.1 | 52 | Enum hallucination |
| Phi-4-14B | Q5_K_M | 10.8 GB | 99.3 | 90.6 | 73 | Verbose retries |
| Granite-3.2-8B-Instruct | Q5_K_M | 6.7 GB | 99.2 | 87.4 | 89 | Solid baseline |
| Hermes-4-Llama-3.3-8B | Q5_K_M | 6.5 GB | 99.0 | 89.7 | 92 | Best 8B |
Two things stand out. First, structural validity is essentially a solved problem once you use constrained decoding — everything above clears 97.8%, and the only model below 99% (Mistral-Small-3.2) loses points to enum-value drift the grammar cannot catch when the enum is large. Second, the field-level accuracy gap between Qwen3-14B and Qwen3-32B is only 1.3 points, while throughput more than doubles. That is why we recommend 14B as the default.
Runtime stack comparison
Same model (Qwen3-14B Q5_K_M), four different runtime/decoder combinations, same 700-doc corpus. This is where a lot of teams leave performance on the table.
| Stack | Decoder | Valid % | Tok/s | Setup difficulty |
|---|---|---|---|---|
| llama.cpp b4892 | XGrammar | 99.6 | 71 | Medium |
| llama.cpp b4892 | GBNF (native) | 99.6 | 58 | Medium |
| vLLM 0.7.3 | XGrammar | 99.6 | 104 | High |
| Ollama 0.5.11 | format=json + schema | 99.4 | 62 | Low |
| llama.cpp b4892 | Prompt only (no grammar) | 93.1 | 78 | Low |
The vLLM + XGrammar combination is the highest throughput path if you can afford the operational complexity. For most teams, Ollama with its native structured-output API (added in 0.5) is the right tradeoff — you give up ~40% throughput vs. vLLM, but setup is one ollama pull away. See the Ollama structured outputs announcement for the API shape.
How to wire this up in 10 minutes
The fastest path from zero to production-quality JSON extraction on a local box:
- Install Ollama 0.5+ (
curl -fsSL https://ollama.com/install.sh | sh). - Pull the model:
ollama pull qwen3:14b-instruct-q5_K_M. - Define your schema as a Pydantic model.
- Call Ollama's
/api/chatwith theformatfield set tomodel.model_json_schema(). - Validate the response with
Model.model_validate_json(response["message"]["content"]).
from pydantic import BaseModel
from ollama import chat
class LineItem(BaseModel):
description: str
quantity: int
unit_price_cents: int
class Invoice(BaseModel):
vendor: str
currency: str
items: list[LineItem]
total_cents: int
resp = chat(
model="qwen3:14b-instruct-q5_K_M",
messages=[{"role": "user", "content": INVOICE_TEXT}],
format=Invoice.model_json_schema(),
options={"temperature": 0},
)
invoice = Invoice.model_validate_json(resp["message"]["content"])
For production workloads with concurrent requests, swap Ollama for vLLM with the --guided-decoding-backend xgrammar flag. The same Pydantic schema works via the OpenAI-compatible response_format field. The full reference implementation, including retry logic for the rare 0.4% of invalid outputs, ships with our open-source quelllm-mcp server, which exposes any local model as an MCP endpoint with built-in schema validation.
Edge cases that still break local models
Constrained decoding is not magic. Three failure modes show up consistently in our data:
- Truncation under
max_tokens. If the schema requires 12 fields and the model hits the token limit at field 9, the grammar will dutifully refuse to emit the closing braces, and you get an unterminated JSON. Always setmax_tokenswell above the worst-case schema size. Budget 4× your average response length. - Date and number formats. The grammar accepts
"2026-05-16"and"05/16/2026"equally if your schema saysstring. Useformat: "date"in JSON Schema and validate downstream — XGrammar respects format constraints, GBNF does not. - Large enums. When an enum has more than ~50 values (e.g. country codes, ESCO occupation codes), even Qwen3-32B picks plausible-but-wrong values 2–3% of the time. The grammar guarantees the value is in the enum; it does not guarantee it is the right one. Use embedding-based reranking on top.
Our finding on enums aligns with the failure analysis in the Outlines paper and the more recent Qwen3-14B model card's own evaluation appendix.
Cost: local vs. the API alternative
The reason most teams move extraction in-house is unit economics. A back-of-envelope based on our throughput numbers, assuming a 24/7 pipeline processing 1 million documents/month at ~800 input + 400 output tokens each:
| Option | Hardware/API | Monthly cost (USD) | Notes |
|---|---|---|---|
| Qwen3-14B local | 1× RTX 4090 (amortized 36 mo) | ~$68 + power ($31) | Single-stream sufficient |
| Qwen3-14B local (vLLM) | 1× RTX 4090 | ~$99 all-in | 4 concurrent streams |
| GPT-5-mini (API) | — | ~$540 | At Jan 2026 pricing |
| Claude Haiku 4.5 (API) | — | ~$410 | At Jan 2026 pricing |
Run the numbers for your own workload with our cost calculator, or the French-language equivalent at quelllm.fr. Note the local option breaks even against the cheapest API at roughly 180,000 documents/month.
Verdict
| Use case | Recommended model | Stack |
|---|---|---|
| Default extraction pipeline | Qwen3-14B-Instruct Q5_K_M | Ollama 0.5+ or vLLM + XGrammar |
| Highest accuracy, no latency constraint | Qwen3-32B-Instruct Q4_K_M | vLLM + XGrammar |
| 8 GB VRAM laptop | Qwen3-4B-Instruct-2507 Q4_K_M | Ollama |
| CPU-only / edge | Qwen3-4B-Instruct-2507 Q4_K_M | llama.cpp + GBNF |
| Long-document / RAG-style extraction | Gemma-3-27B-it Q4_K_M | vLLM + XGrammar |
Qwen3-14B with XGrammar is the model 80% of teams should be running. It is not the most accurate (32B beats it by 1.3 points) and it is not the fastest (4B doubles its throughput), but it is the model that wins on the diagonal — accuracy, throughput, and VRAM footprint together — for any workload where you would otherwise reach for a hosted API.
Frequently asked questions
Does Ollama really enforce JSON Schema, or just JSON syntax?
As of Ollama 0.5, the format field accepts a full JSON Schema object (not just the string "json") and enforces it via llama.cpp's grammar backend. You get the same structural guarantees as direct llama.cpp + GBNF.
Can I use a smaller quantization like Q3_K_M to fit on 8 GB?
Q3_K_M of Qwen3-14B fits in ~7.2 GB but field-level accuracy drops by 3.8 points in our tests. You are better off using Qwen3-4B at Q4_K_M, which is more accurate and 2× faster.
What about Gemma-3-27B for extraction?
Gemma-3-27B is excellent — within 0.7 points of Qwen3-32B on field accuracy and noticeably better on long PDF-derived text. The drawback is throughput: 41 tok/s vs. Qwen3-14B's 71. Pick it if your documents routinely exceed 8K tokens.
Why not Llama-3.3-70B?
It is excellent on accuracy (95.9%) but requires two 24 GB GPUs and outputs only 14 tok/s. The cost-per-document is roughly 5× a Qwen3-14B setup with negligible quality gain.
Does temperature matter for JSON extraction?
Yes. Set temperature=0 for deterministic field-level outputs. Higher temperatures with constrained decoding still produce valid JSON but introduce field-value variance you do not want in an extraction pipeline.
Can I run this on Apple Silicon?
Yes. llama.cpp's Metal backend supports XGrammar and GBNF. An M3 Max 64 GB runs Qwen3-14B Q5_K_M at ~38 tok/s — slower than a 4090 but production-viable for batch pipelines.