BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM for Document Summarization — Long Docs Tested

We summarized 200-page PDFs on consumer GPUs across nine open models. One wins on coherence, one wins on VRAM, and one wins on speed.

By Mohamed Meguedmi · 11 min read

Key Takeaways

  • Best overall (24GB VRAM): Qwen3 32B Q4_K_M at 128K context — highest ROUGE-L and the most coherent multi-section summaries in our 200-page test set.
  • Best on 16GB VRAM: Gemma 3 12B Q5_K_M — single-pass 64K, near-flagship coherence, 38 tok/s on an RTX 4070 Ti.
  • Best on 8GB VRAM: Qwen3 8B Q4_K_M with map-reduce chunking — beats every other small model on factual recall.
  • Skip: Llama 3.3 70B for summarization unless you have 48GB+ — quality gain over Qwen3 32B is under 3% on ROUGE-L and not worth the 4× cost per token.
  • Chunking still matters beyond 80K tokens — KV-cache attention degrades and recall drops 9-14 points on multi-document inputs.

Long-document summarization is the use case where local LLMs have finally caught up with frontier APIs. With Qwen3 32B and Gemma 3 shipping native 128K context windows, you can drop a full annual report, court filing, or arXiv preprint into a model running on a single consumer GPU and get an answer in under 60 seconds. The question is no longer can you do it — it's which model, at what quantization, on what hardware.

This guide is the result of running nine open-weight models against a benchmark of 24 long documents (median length 187 pages, longest 412 pages) sourced from SEC 10-Ks, EU regulatory filings, academic preprints, and technical manuals. We measured ROUGE-L against human-authored executive summaries, factual recall on a 12-question probe set, and end-to-end latency on three GPU tiers.

What "long document" actually means in 2026

The SERP is full of 2024-era advice that conflates token limits with usable context. A 128K context window on paper is not the same as 128K of effective attention. Three things matter:

  • Stated context length — what the tokenizer and position embeddings support (e.g. 128K, 256K, 1M).
  • Trained context length — how far the model was actually trained or fine-tuned with long sequences. Many models advertise 128K but were trained on 32K with RoPE scaling, which collapses recall past ~48K.
  • VRAM-bounded context — the actual window your GPU can hold once the KV cache is loaded.

For a 200-page PDF (roughly 90-110K tokens after extraction), only the third number matters. The table below shows what fits on each consumer tier:

GPUVRAMModel (Q4_K_M)Model sizeKV cache headroomMax usable context
RTX 306012 GBQwen3 8B4.9 GB~6.5 GB~48K tokens
RTX 4070 Ti16 GBGemma 3 12B7.8 GB~7.5 GB~64K tokens
RTX 4090 / 7900 XTX24 GBQwen3 32B18.4 GB~4.8 GB~96K tokens
RTX 4090 + offload24 GB + 64 GB RAMQwen3 32B18.4 GBCPU spillover128K tokens (slow)
2× RTX 309048 GBLlama 3.3 70B40 GB~6 GB~64K tokens

If you want the math behind this for your own GPU, our cost calculator includes a KV-cache estimator that takes head count, head dimension, and quantization into account.

The benchmark: 24 long documents, three difficulty tiers

We split the corpus into three tiers to stress different failure modes:

  • Tier A — single-document narrative (40-100 pages): SEC 10-Ks, white papers, court rulings. Tests coherence and structural fidelity.
  • Tier B — multi-document synthesis (8-15 source docs, ~80K tokens total): research literature reviews, due-diligence packets. Tests cross-document reasoning.
  • Tier C — extreme length (200-412 pages): regulatory filings, technical manuals, dissertations. Tests positional decay.

Each summary was scored on three axes: ROUGE-L F1 against a human-authored 600-word executive summary, factual recall (12 probe questions per doc, scored 0/1 by a Claude Sonnet 4.6 judge with the source as ground truth), and hallucination rate (claims unsupported by the source). Full methodology is on our methodology page.

Results across all tiers

ModelQuantVRAM usedROUGE-L (avg)Factual recallHallucinationTok/s (4090)
Qwen3 32BQ4_K_M18.4 GB0.41287.3%2.1%34
Llama 3.3 70BQ4_K_M40 GB0.41889.1%1.4%9 (2×3090)
Gemma 3 27BQ4_K_M16.8 GB0.39483.6%2.8%29
Gemma 3 12BQ5_K_M9.2 GB0.37178.4%3.6%38
Qwen3 14BQ4_K_M8.7 GB0.36879.2%3.9%42
Mistral Small 3.1 24BQ4_K_M14.1 GB0.35976.8%4.1%31
Qwen3 8BQ4_K_M4.9 GB0.34173.5%4.8%61
Llama 3.1 8BQ4_K_M4.7 GB0.29864.1%8.2%64
Phi-4 14BQ4_K_M8.4 GB0.31269.4%6.7%44

Two findings deserve attention. First, the gap between Qwen3 32B and Llama 3.3 70B is statistically insignificant on ROUGE-L (Δ 0.006) and only 1.8 points on recall — Llama wins on hallucination control, but the 4× memory cost is hard to justify. Second, Qwen3 punches above its weight at every size: the 8B model beats Llama 3.1 8B by 9 recall points and halves the hallucination rate.

The verdict by hardware tier

24 GB VRAM — Qwen3 32B Q4_K_M

This is the new default. With Qwen3 32B at Q4_K_M you get 18.4 GB of weights plus enough headroom for a 96K-token KV cache — that covers 90% of single-document jobs in one pass. For documents over 100K tokens, enable Q4 KV-cache quantization in llama.cpp (--cache-type-k q4_0 --cache-type-v q4_0) to push the usable window to 128K with negligible quality loss (we measured a 0.7-point ROUGE-L drop).

16 GB VRAM — Gemma 3 12B Q5_K_M

Gemma 3 12B at Q5_K_M is the sweet spot for a 4070 Ti or 4080. You get 64K of clean context, 38 tok/s, and ROUGE-L within 4 points of the 32B tier. The Gemma 3 12B model card documents the sliding-window attention pattern that keeps memory low — it's the architectural reason Gemma 3 outperforms Mistral Small 24B despite being half the size.

8 GB VRAM — Qwen3 8B with map-reduce

Below 12 GB you cannot fit a long document in a single context window — period. Use a map-reduce pipeline: chunk the document into 6-8K token segments with 400-token overlap, summarize each, then ask the model to synthesize a final summary from the chunk summaries. Qwen3 8B at Q4_K_M scored 0.358 ROUGE-L under this regime — within 5 points of the same model run in single-pass mode at 48K. The classic Reddit chunked-summarization recipe still works; Qwen3 8B is just a far better backbone than the models discussed in those 2024 threads.

Why most "128K context" claims fail past 80K

Positional decay is the dirty secret of long-context summarization. Even on models that genuinely train at 128K, retrieval accuracy drops past the 60-80K mark. We ran a needle-in-haystack probe (a single fabricated fact buried at varying depths) on the top three models:

DepthQwen3 32BGemma 3 27BLlama 3.3 70B
0-25%100%100%100%
25-50%98%96%99%
50-75%94%87%97%
75-100%89%71%94%

The lesson: for documents over 80K tokens, hierarchical summarization (chunk → mini-summary → final summary) outperforms single-pass even on models that technically support the full window. Llama 3.3 70B is the only model that holds up cleanly across the full 128K — but you're paying 4× the memory for that.

How to run the recommended stack

Single-pass 96K summarization on a 4090 (Qwen3 32B)

  1. Install Ollama 0.5.4 or newer from ollama.com.
  2. Pull the model: ollama pull qwen3:32b-q4_K_M (≈18.4 GB download).
  3. Set the context window: create a Modelfile with PARAMETER num_ctx 98304 and PARAMETER num_gpu 999 to keep all layers on GPU.
  4. Extract text from PDF with pdftotext or pymupdf4llm — keep page breaks as \n---\n markers; this measurably improves structural fidelity in the output.
  5. Prompt template: system message asking for an executive summary with section headings matching the document's table of contents, then user message containing the full text.
  6. Run: ollama run qwen3-long < document.txt — expect 45-90 seconds for a 200-page document at 34 tok/s.

For programmatic access, our BestLLMfor public benchmark API (CC BY 4.0) exposes all 24 benchmark documents and reference summaries — useful if you want to reproduce the scores on your own setup. The quelllm-mcp open-source server wraps Ollama with a Model Context Protocol interface that handles chunking and KV-cache management automatically.

What about RAG instead of long context?

For Q&A and lookup, RAG wins on cost and latency. For summarization, it loses badly. We compared single-pass Qwen3 32B at 96K context against a RAG pipeline (BGE-M3 embeddings, top-20 chunks, same model as generator): RAG scored 0.347 ROUGE-L versus 0.412 single-pass — a 16% drop. The reason is intuitive: summarization needs global coherence, and retrieval throws away structure. Use RAG when the user has a question; use long-context summarization when the user wants the gist.

The recent HERA paper (Feb 2025) proposes a middle path — context packaging and reordering before summarization — that pushes ROUGE-L up another 4-6 points on multi-document inputs. We're integrating it into the next benchmark cycle.

Cost of ownership versus API alternatives

SetupUpfront costCost per 200-page docPrivacyLatency
RTX 4090 + Qwen3 32B$1,800$0.003 (electricity)Full local60 s
RTX 4070 Ti + Gemma 3 12B$900$0.002Full local40 s
Claude Sonnet 4.6 API$0$0.31API processor15 s
GPT-5 API$0$0.42API processor22 s

Break-even for the RTX 4090 setup is around 5,800 documents — roughly 16 documents a day for a year. For high-volume regulated workloads (legal, healthcare, finance), the privacy story alone justifies local from day one. For French-language workflows, the parallel guide on quelllm.fr covers Mistral and Croissant-LLM specifically.

Frequently Asked Questions

Can I really summarize a 400-page PDF on a single consumer GPU?

Yes, but not in a single pass on 24 GB. A 400-page document is roughly 180-220K tokens, which exceeds the usable context of every consumer-tier model we tested. Use map-reduce chunking (8K chunks with 400-token overlap), summarize each, then ask Qwen3 32B to synthesize a final summary from the chunk summaries. Total time: 4-7 minutes on an RTX 4090.

Is Qwen3 32B really better than Llama 3.3 70B for this?

For summarization specifically, the difference is 0.006 ROUGE-L points and 1.8 recall points in favor of Llama. Llama 3.3 70B does have lower hallucination rates (1.4% vs 2.1%), so for high-stakes legal or medical summaries it remains the safer choice if you have 48GB+ VRAM. For everything else, Qwen3 32B is the better value.

Why not just use Claude or GPT for summarization?

Three reasons: privacy (the document never leaves your machine), cost at volume (break-even around 5,800 documents), and reproducibility (your local model doesn't get silently updated). For one-off summaries of non-sensitive documents, the APIs are faster and cheaper per-call.

Does KV-cache quantization hurt summary quality?

Barely. We measured a 0.7-point ROUGE-L drop using Q4 KV-cache versus FP16 on Qwen3 32B at 96K context. The memory savings (roughly 4×) are usually worth it. Avoid Q4 KV-cache on Gemma 3 — its sliding-window attention is more sensitive and we saw a 2.3-point drop.

What about Mistral Large 2 or Command R+?

Both are excellent but require 64GB+ VRAM at usable quantizations, which puts them outside the consumer tier this guide targets. Command R+ in particular has strong long-context performance — if you have access to dual A6000s or an H100, it's competitive with Llama 3.3 70B.

Verdict

If you have 24 GB of VRAM, install Qwen3 32B at Q4_K_M today — it is the best local summarization model of 2026, full stop. If you have 16 GB, Gemma 3 12B at Q5_K_M will get you 90% of the quality at half the memory. If you have 8 GB, Qwen3 8B with map-reduce chunking is the only model worth your time at that tier. Skip Llama 3.3 70B unless hallucination control is critical and you have the hardware to run it cleanly.