Best Local LLM for Document Summarization — Long Docs Tested
We summarized 200-page PDFs on consumer GPUs across nine open models. One wins on coherence, one wins on VRAM, and one wins on speed.
By Mohamed Meguedmi · 11 min read
Key Takeaways
- Best overall (24GB VRAM): Qwen3 32B Q4_K_M at 128K context — highest ROUGE-L and the most coherent multi-section summaries in our 200-page test set.
- Best on 16GB VRAM: Gemma 3 12B Q5_K_M — single-pass 64K, near-flagship coherence, 38 tok/s on an RTX 4070 Ti.
- Best on 8GB VRAM: Qwen3 8B Q4_K_M with map-reduce chunking — beats every other small model on factual recall.
- Skip: Llama 3.3 70B for summarization unless you have 48GB+ — quality gain over Qwen3 32B is under 3% on ROUGE-L and not worth the 4× cost per token.
- Chunking still matters beyond 80K tokens — KV-cache attention degrades and recall drops 9-14 points on multi-document inputs.
Long-document summarization is the use case where local LLMs have finally caught up with frontier APIs. With Qwen3 32B and Gemma 3 shipping native 128K context windows, you can drop a full annual report, court filing, or arXiv preprint into a model running on a single consumer GPU and get an answer in under 60 seconds. The question is no longer can you do it — it's which model, at what quantization, on what hardware.
This guide is the result of running nine open-weight models against a benchmark of 24 long documents (median length 187 pages, longest 412 pages) sourced from SEC 10-Ks, EU regulatory filings, academic preprints, and technical manuals. We measured ROUGE-L against human-authored executive summaries, factual recall on a 12-question probe set, and end-to-end latency on three GPU tiers.
What "long document" actually means in 2026
The SERP is full of 2024-era advice that conflates token limits with usable context. A 128K context window on paper is not the same as 128K of effective attention. Three things matter:
- Stated context length — what the tokenizer and position embeddings support (e.g. 128K, 256K, 1M).
- Trained context length — how far the model was actually trained or fine-tuned with long sequences. Many models advertise 128K but were trained on 32K with RoPE scaling, which collapses recall past ~48K.
- VRAM-bounded context — the actual window your GPU can hold once the KV cache is loaded.
For a 200-page PDF (roughly 90-110K tokens after extraction), only the third number matters. The table below shows what fits on each consumer tier:
| GPU | VRAM | Model (Q4_K_M) | Model size | KV cache headroom | Max usable context |
|---|---|---|---|---|---|
| RTX 3060 | 12 GB | Qwen3 8B | 4.9 GB | ~6.5 GB | ~48K tokens |
| RTX 4070 Ti | 16 GB | Gemma 3 12B | 7.8 GB | ~7.5 GB | ~64K tokens |
| RTX 4090 / 7900 XTX | 24 GB | Qwen3 32B | 18.4 GB | ~4.8 GB | ~96K tokens |
| RTX 4090 + offload | 24 GB + 64 GB RAM | Qwen3 32B | 18.4 GB | CPU spillover | 128K tokens (slow) |
| 2× RTX 3090 | 48 GB | Llama 3.3 70B | 40 GB | ~6 GB | ~64K tokens |
If you want the math behind this for your own GPU, our cost calculator includes a KV-cache estimator that takes head count, head dimension, and quantization into account.
The benchmark: 24 long documents, three difficulty tiers
We split the corpus into three tiers to stress different failure modes:
- Tier A — single-document narrative (40-100 pages): SEC 10-Ks, white papers, court rulings. Tests coherence and structural fidelity.
- Tier B — multi-document synthesis (8-15 source docs, ~80K tokens total): research literature reviews, due-diligence packets. Tests cross-document reasoning.
- Tier C — extreme length (200-412 pages): regulatory filings, technical manuals, dissertations. Tests positional decay.
Each summary was scored on three axes: ROUGE-L F1 against a human-authored 600-word executive summary, factual recall (12 probe questions per doc, scored 0/1 by a Claude Sonnet 4.6 judge with the source as ground truth), and hallucination rate (claims unsupported by the source). Full methodology is on our methodology page.
Results across all tiers
| Model | Quant | VRAM used | ROUGE-L (avg) | Factual recall | Hallucination | Tok/s (4090) |
|---|---|---|---|---|---|---|
| Qwen3 32B | Q4_K_M | 18.4 GB | 0.412 | 87.3% | 2.1% | 34 |
| Llama 3.3 70B | Q4_K_M | 40 GB | 0.418 | 89.1% | 1.4% | 9 (2×3090) |
| Gemma 3 27B | Q4_K_M | 16.8 GB | 0.394 | 83.6% | 2.8% | 29 |
| Gemma 3 12B | Q5_K_M | 9.2 GB | 0.371 | 78.4% | 3.6% | 38 |
| Qwen3 14B | Q4_K_M | 8.7 GB | 0.368 | 79.2% | 3.9% | 42 |
| Mistral Small 3.1 24B | Q4_K_M | 14.1 GB | 0.359 | 76.8% | 4.1% | 31 |
| Qwen3 8B | Q4_K_M | 4.9 GB | 0.341 | 73.5% | 4.8% | 61 |
| Llama 3.1 8B | Q4_K_M | 4.7 GB | 0.298 | 64.1% | 8.2% | 64 |
| Phi-4 14B | Q4_K_M | 8.4 GB | 0.312 | 69.4% | 6.7% | 44 |
Two findings deserve attention. First, the gap between Qwen3 32B and Llama 3.3 70B is statistically insignificant on ROUGE-L (Δ 0.006) and only 1.8 points on recall — Llama wins on hallucination control, but the 4× memory cost is hard to justify. Second, Qwen3 punches above its weight at every size: the 8B model beats Llama 3.1 8B by 9 recall points and halves the hallucination rate.
The verdict by hardware tier
24 GB VRAM — Qwen3 32B Q4_K_M
This is the new default. With Qwen3 32B at Q4_K_M you get 18.4 GB of weights plus enough headroom for a 96K-token KV cache — that covers 90% of single-document jobs in one pass. For documents over 100K tokens, enable Q4 KV-cache quantization in llama.cpp (--cache-type-k q4_0 --cache-type-v q4_0) to push the usable window to 128K with negligible quality loss (we measured a 0.7-point ROUGE-L drop).
16 GB VRAM — Gemma 3 12B Q5_K_M
Gemma 3 12B at Q5_K_M is the sweet spot for a 4070 Ti or 4080. You get 64K of clean context, 38 tok/s, and ROUGE-L within 4 points of the 32B tier. The Gemma 3 12B model card documents the sliding-window attention pattern that keeps memory low — it's the architectural reason Gemma 3 outperforms Mistral Small 24B despite being half the size.
8 GB VRAM — Qwen3 8B with map-reduce
Below 12 GB you cannot fit a long document in a single context window — period. Use a map-reduce pipeline: chunk the document into 6-8K token segments with 400-token overlap, summarize each, then ask the model to synthesize a final summary from the chunk summaries. Qwen3 8B at Q4_K_M scored 0.358 ROUGE-L under this regime — within 5 points of the same model run in single-pass mode at 48K. The classic Reddit chunked-summarization recipe still works; Qwen3 8B is just a far better backbone than the models discussed in those 2024 threads.
Why most "128K context" claims fail past 80K
Positional decay is the dirty secret of long-context summarization. Even on models that genuinely train at 128K, retrieval accuracy drops past the 60-80K mark. We ran a needle-in-haystack probe (a single fabricated fact buried at varying depths) on the top three models:
| Depth | Qwen3 32B | Gemma 3 27B | Llama 3.3 70B |
|---|---|---|---|
| 0-25% | 100% | 100% | 100% |
| 25-50% | 98% | 96% | 99% |
| 50-75% | 94% | 87% | 97% |
| 75-100% | 89% | 71% | 94% |
The lesson: for documents over 80K tokens, hierarchical summarization (chunk → mini-summary → final summary) outperforms single-pass even on models that technically support the full window. Llama 3.3 70B is the only model that holds up cleanly across the full 128K — but you're paying 4× the memory for that.
How to run the recommended stack
Single-pass 96K summarization on a 4090 (Qwen3 32B)
- Install Ollama 0.5.4 or newer from ollama.com.
- Pull the model:
ollama pull qwen3:32b-q4_K_M(≈18.4 GB download). - Set the context window: create a Modelfile with
PARAMETER num_ctx 98304andPARAMETER num_gpu 999to keep all layers on GPU. - Extract text from PDF with pdftotext or
pymupdf4llm— keep page breaks as\n---\nmarkers; this measurably improves structural fidelity in the output. - Prompt template: system message asking for an executive summary with section headings matching the document's table of contents, then user message containing the full text.
- Run:
ollama run qwen3-long < document.txt— expect 45-90 seconds for a 200-page document at 34 tok/s.
For programmatic access, our BestLLMfor public benchmark API (CC BY 4.0) exposes all 24 benchmark documents and reference summaries — useful if you want to reproduce the scores on your own setup. The quelllm-mcp open-source server wraps Ollama with a Model Context Protocol interface that handles chunking and KV-cache management automatically.
What about RAG instead of long context?
For Q&A and lookup, RAG wins on cost and latency. For summarization, it loses badly. We compared single-pass Qwen3 32B at 96K context against a RAG pipeline (BGE-M3 embeddings, top-20 chunks, same model as generator): RAG scored 0.347 ROUGE-L versus 0.412 single-pass — a 16% drop. The reason is intuitive: summarization needs global coherence, and retrieval throws away structure. Use RAG when the user has a question; use long-context summarization when the user wants the gist.
The recent HERA paper (Feb 2025) proposes a middle path — context packaging and reordering before summarization — that pushes ROUGE-L up another 4-6 points on multi-document inputs. We're integrating it into the next benchmark cycle.
Cost of ownership versus API alternatives
| Setup | Upfront cost | Cost per 200-page doc | Privacy | Latency |
|---|---|---|---|---|
| RTX 4090 + Qwen3 32B | $1,800 | $0.003 (electricity) | Full local | 60 s |
| RTX 4070 Ti + Gemma 3 12B | $900 | $0.002 | Full local | 40 s |
| Claude Sonnet 4.6 API | $0 | $0.31 | API processor | 15 s |
| GPT-5 API | $0 | $0.42 | API processor | 22 s |
Break-even for the RTX 4090 setup is around 5,800 documents — roughly 16 documents a day for a year. For high-volume regulated workloads (legal, healthcare, finance), the privacy story alone justifies local from day one. For French-language workflows, the parallel guide on quelllm.fr covers Mistral and Croissant-LLM specifically.
Frequently Asked Questions
Can I really summarize a 400-page PDF on a single consumer GPU?
Yes, but not in a single pass on 24 GB. A 400-page document is roughly 180-220K tokens, which exceeds the usable context of every consumer-tier model we tested. Use map-reduce chunking (8K chunks with 400-token overlap), summarize each, then ask Qwen3 32B to synthesize a final summary from the chunk summaries. Total time: 4-7 minutes on an RTX 4090.
Is Qwen3 32B really better than Llama 3.3 70B for this?
For summarization specifically, the difference is 0.006 ROUGE-L points and 1.8 recall points in favor of Llama. Llama 3.3 70B does have lower hallucination rates (1.4% vs 2.1%), so for high-stakes legal or medical summaries it remains the safer choice if you have 48GB+ VRAM. For everything else, Qwen3 32B is the better value.
Why not just use Claude or GPT for summarization?
Three reasons: privacy (the document never leaves your machine), cost at volume (break-even around 5,800 documents), and reproducibility (your local model doesn't get silently updated). For one-off summaries of non-sensitive documents, the APIs are faster and cheaper per-call.
Does KV-cache quantization hurt summary quality?
Barely. We measured a 0.7-point ROUGE-L drop using Q4 KV-cache versus FP16 on Qwen3 32B at 96K context. The memory savings (roughly 4×) are usually worth it. Avoid Q4 KV-cache on Gemma 3 — its sliding-window attention is more sensitive and we saw a 2.3-point drop.
What about Mistral Large 2 or Command R+?
Both are excellent but require 64GB+ VRAM at usable quantizations, which puts them outside the consumer tier this guide targets. Command R+ in particular has strong long-context performance — if you have access to dual A6000s or an H100, it's competitive with Llama 3.3 70B.
Verdict
If you have 24 GB of VRAM, install Qwen3 32B at Q4_K_M today — it is the best local summarization model of 2026, full stop. If you have 16 GB, Gemma 3 12B at Q5_K_M will get you 90% of the quality at half the memory. If you have 8 GB, Qwen3 8B with map-reduce chunking is the only model worth your time at that tier. Skip Llama 3.3 70B unless hallucination control is critical and you have the hardware to run it cleanly.