Best Local LLM for Fiction Writing — Tested on 50 Pages
We ran six open-weight models through a 50-page novella stress test. One pulled ahead on voice, pacing, and the dreaded chapter-12 amnesia.
By Mohamed Meguedmi · 11 min read
Key Takeaways
- Overall winner:
Mistral-Large-Instruct-2411 Q4_K_M(123B) — best prose voice, lowest contradiction rate (1.2 per 10k tokens), needs 72 GB VRAM. - Best on 24 GB consumer GPU:
Qwen3-32B-Instruct Q5_K_M— 14.3 tok/s on an RTX 4090, only model to keep a side character's eye color consistent across 50 pages. - Best for very long context (200k+):
Command-R-Plus-08-2024 Q4_K_M— outperforms larger models past 64k tokens thanks to grouped-query attention. - Skip: Llama 3.3 70B for fiction. Strong reasoner, but its prose averages 18% more cliché phrases per 1k words than Mistral Large in our blind eval.
- Reality check: No local model yet matches Claude or GPT-4 for chapter-level narrative arc, but Mistral Large closes the gap at the sentence level.
Fiction is the hardest test we run. A model can ace MMLU and still write a love scene that reads like a microwave manual. To find which open-weight model is actually usable for novelists in 2026, the BestLLMfor editorial team commissioned a 50-page (≈ 22,000-word) novella across six leading local models, then graded the outputs blind against four criteria: voice consistency, long-range coherence, prose freshness, and dialogue believability. The methodology is published in full at /methodology/.
How we tested: the 50-page protocol
Each model received the same 1,800-word story bible — a near-future psychological thriller set in a Marseille shipyard, with seven named characters, three POV switches, and a non-linear timeline. Models were asked to draft 50 pages (≈ 250 words per page) across 12 chapters, with a fresh 4k-token instruction window per chapter but full prior chapters fed back into context. We logged generation speed, peak VRAM, contradiction count (places where the model violated its own prior text), and ran a double-blind reader panel of three working novelists.
All inference used llama.cpp build b4123 with deterministic sampling: temperature=0.85, top_p=0.9, repeat_penalty=1.08, seed=42. Quants were pulled from official repos where available; see the Bartowski quant collection for the GGUF builds.
The contenders and their hardware costs
We selected six models that cover the realistic VRAM tiers a writer is likely to own — from a single 24 GB consumer card up to a dual-card setup. Cloud renters can plug the numbers into /tools/cost-calculator/ to compare against an A100 hourly rate.
| Model (Quant) | Params | File size | Min VRAM | Tokens/sec* | Context tested |
|---|---|---|---|---|---|
| Mistral-Large-Instruct-2411 Q4_K_M | 123B | 73.2 GB | 72 GB | 6.8 | 128k |
| Qwen3-32B-Instruct Q5_K_M | 32.5B | 23.1 GB | 24 GB | 14.3 | 128k |
| Command-R-Plus-08-2024 Q4_K_M | 104B | 62.0 GB | 64 GB | 7.4 | 200k |
| Llama-3.3-70B-Instruct Q4_K_M | 70B | 42.5 GB | 48 GB | 9.1 | 128k |
| Gemma-3-27B-it Q5_K_M | 27B | 19.4 GB | 24 GB | 16.7 | 128k |
| DeepSeek-V3-0324 Q3_K_M | 671B (MoE, 37B active) | 290 GB | 2× H100 80GB | 11.2 | 128k |
*Measured on a dual RTX 6000 Ada (96 GB combined) for 72 GB+ models, single RTX 4090 24 GB for smaller models. Prompt processing not included.
Verdict by category
Prose voice and freshness
Mistral Large dominated. Across a blind reading of 12 randomly-sampled paragraphs per model, the panel chose Mistral Large's prose 47% of the time, with Command R+ second at 22%. The model rarely reaches for the closest cliché — where Llama 3.3 wrote "her heart pounded in her chest," Mistral wrote "her pulse pressed against the inside of her wrist like something asking to be let out." Not every metaphor lands, but the attempt rate is meaningfully higher.
Cliché density per 1,000 words (panel-flagged, averaged across three readers):
| Model | Clichés / 1k words | Voice consistency (1-10) | Dialogue believability (1-10) |
|---|---|---|---|
| Mistral-Large 123B | 3.1 | 8.7 | 8.2 |
| Command-R-Plus 104B | 4.4 | 8.1 | 7.9 |
| Qwen3-32B | 5.2 | 7.8 | 7.4 |
| DeepSeek-V3 671B | 5.6 | 7.5 | 7.6 |
| Llama-3.3 70B | 6.8 | 7.1 | 6.8 |
| Gemma-3 27B | 7.4 | 6.4 | 6.1 |
Long-range coherence
This is where the wheels usually come off. By chapter 8, Llama 3.3 had renamed a minor character twice and forgotten a key locked-door plot point established in chapter 2. Gemma-3 27B reintroduced a dead character at page 41 without comment. Mistral Large logged the lowest contradiction count at 1.2 per 10k tokens; Command R+ was a close second at 1.6, and notably its lead grows past the 64k-token mark — consistent with the architectural advantages described in the official Command R+ model card.
Dialogue
DeepSeek-V3 surprised here. Its prose tends toward the workmanlike, but it writes the most distinct character voices — five of seven characters were identifiable from a single line of dialogue without attribution, against three of seven for Mistral Large. If you write dialogue-heavy fiction and have access to dual H100s (or a rental), it is worth a serious look.
The 24 GB-VRAM verdict: Qwen3-32B
Most readers won't have 72 GB of VRAM. On a single RTX 4090, RTX 5090, or RTX 6000 Ada, the practical winner is Qwen3-32B-Instruct Q5_K_M. It is the only sub-40B model in our test that kept a side character's stated eye color ("the gray of wet slate") consistent across all 12 chapters. It also handled the non-linear timeline without flattening it into chronological order — a failure mode Gemma-3 fell into by chapter 4. See the Qwen3-32B model card for the full architecture details.
If you want benchmark cross-references beyond our test, Lech Mazur's LLM Creative Story-Writing Benchmark uses pairwise judging on short-form prompts and broadly agrees with our long-form ranking at the top — though it cannot capture chapter-level coherence the way a 50-page test does.
How to run the winner locally
The following walks through running Mistral-Large-Instruct-2411 at Q4_K_M on a 72 GB-VRAM configuration via Ollama. Adjust the model tag for Qwen3-32B if you are on 24 GB.
- Install Ollama 0.5.7+.
curl -fsSL https://ollama.com/install.sh | sh - Pull the model.
ollama pull mistral-large:123b-instruct-2411-q4_K_M— expect ~73 GB download. - Set the context window. Create a
ModelfilewithPARAMETER num_ctx 32768for novel-length sessions. 32k tokens covers roughly the last 7-8 chapters in working memory. - Tune sampling for prose.
PARAMETER temperature 0.85,PARAMETER top_p 0.9,PARAMETER repeat_penalty 1.08. Higher temps produce more surprise but degrade coherence past 0.95. - Feed a story bible first. Always pin your character sheet and timeline at the start of context — do not rely on the model to extract them from prior chapters.
For programmatic use, the open-source quelllm-mcp server (MIT) exposes any local Ollama or llama.cpp endpoint as a Model Context Protocol tool — useful if you write inside an MCP-aware editor and want chapter-aware retrieval over your draft. The BestLLMfor public API (CC BY 4.0) ships the same benchmark data shown above as JSON; see /about/ for credentials.
What local still can't do
Be honest with yourself: no open-weight model in May 2026 plans a 50-page arc as well as Claude Opus 4.6 or GPT-5. The gap is narrowing — Mistral Large now writes sentences that would have been frontier-only 18 months ago — but structural decisions (planting a clue in chapter 2 that pays off in chapter 11) still benefit from a human outline or a frontier model used as a planning consultant before handing pages to the local model for execution. French-speaking writers should note that quelllm.fr ran the equivalent test on French-language prose and found Mistral Large's lead is even larger in French, unsurprisingly.
Final verdict
| Use case | Pick | Why |
|---|---|---|
| Novella drafting, no VRAM ceiling | Mistral-Large-Instruct-2411 Q4_K_M | Best prose, lowest contradiction rate, 1.2 errors / 10k tokens. |
| Single 24 GB GPU (RTX 4090/5090) | Qwen3-32B-Instruct Q5_K_M | Only mid-size model that holds details across 50 pages. |
| Very long context (> 64k working tokens) | Command-R-Plus-08-2024 Q4_K_M | GQA architecture degrades gracefully past 64k. |
| Dialogue-heavy fiction, dual-H100 access | DeepSeek-V3-0324 Q3_K_M | Most distinct per-character voice in the panel. |
| Budget / 16 GB GPU | Gemma-3-27B-it Q5_K_M (with caveats) | Usable for scenes, not for full chapters; expect coherence failures. |
FAQ
Can I write a full novel on a single RTX 4090?
Yes, using Qwen3-32B-Instruct at Q5_K_M with 32k context. Expect about 14 tokens/sec, or roughly 700 words per minute of generation. A 90,000-word novel takes ~2 hours of pure inference, plus your editing time.
Does fine-tuning beat prompting for fiction?
For voice imitation, yes — a QLoRA fine-tune of Qwen3-32B on 200k words of your prior work measurably improves voice consistency. For plot coherence, no; fine-tuning does not fix long-range memory limits.
Why isn't Llama 3.3 70B the top pick?
It is an excellent reasoner and instruction-follower, but its prose defaults to higher cliché density (6.8 per 1k words vs. Mistral Large's 3.1) and our panel ranked it 4th of 6 on voice. It remains a strong pick for technical writing.
Is a quantized model good enough for fiction?
Q4_K_M and Q5_K_M produce prose indistinguishable from FP16 in blind reading panels we have run previously. Drop below Q3 and contradiction rates climb sharply — DeepSeek-V3 at Q3_K_M is the absolute floor we would recommend.
How much context do I actually need?
32k tokens covers about 24,000 words — roughly the last 6-8 chapters of a typical novel. Beyond that, retrieve summaries of older chapters rather than dumping raw text; long-context performance degrades on all tested models past 64k working tokens, despite the advertised limits.
Methodology notes, the 1,800-word story bible, and all 22,000-word output samples are published under CC BY 4.0 via the BestLLMfor public API. See /methodology/ for the full grading rubric.