Best Local LLM for Translation — 50+ Languages Tested
We benchmarked 11 open-weight models across 52 language pairs on consumer GPUs. One model wins overall — but the picks shift hard once you leave the top 10 languages.
By Mohamed Meguedmi · 11 min read
Key Takeaways
- Overall winner: Qwen3 32B Q4_K_M tops our 52-pair COMET-22 suite at 0.847, beating Gemma 3 27B by 2.1 points and edging Aya Expanse 32B by 0.9.
- Best under 16 GB VRAM: Aya Expanse 8B Q5_K_M covers 23 languages with COMET 0.812 — within 0.8 pt of models 4× its size.
- Low-resource languages win: NLLB-200 3.3B distilled still beats every general-purpose LLM on Swahili, Yoruba, Burmese, Khmer and Uzbek.
- Avoid: Llama 3.3 70B for non-Latin scripts — it hallucinates terminology in Japanese↔Korean and CJK→Arabic at 3× the rate of Qwen3.
- Cheapest viable setup: a single 16 GB RTX 4060 Ti runs Aya Expanse 8B at 41 tok/s — under $450 used.
Most translation roundups still benchmark closed APIs against Google Translate. That misses the point if you handle medical records, contracts, or anything you cannot ship to a third party. We restricted this test to open-weight models you can run offline on a single consumer GPU, and we evaluated them the way professional MT vendors do: COMET-22 against human references, plus blind human review on a 4-point fluency scale.
The full evaluation set, prompts, and per-language scores are published under CC BY 4.0 via the BestLLMfor public API — you can replicate everything below.
How We Tested
We pulled 200 segments per language pair from the FLORES-200 devtest set plus 50 segments of in-domain content (legal, medical, marketing, casual chat) sourced from public corpora. Total: 13,000 segments across 52 pairs covering 27 unique target languages.
Each model received the same zero-shot prompt: Translate the following text from {src} to {tgt}. Output only the translation. We scored with Unbabel/wmt22-comet-da, the WMT22 reference metric, and cross-checked the top 4 models with two human reviewers per major language family.
Inference: llama.cpp build 4982, single RTX 4090 (24 GB), context 4096, temperature 0.2. For models that exceed 24 GB at FP16, we used Q4_K_M or Q5_K_M GGUF quants from the official repos. See our methodology page for the full prompt set, scoring scripts, and quant choices.
The 11 Models We Ranked
We filtered HuggingFace for models with explicit multilingual training data, official quants available on ollama.com, and a permissive license. That gave us this field:
| Model | Params | Quant tested | VRAM | Stated languages | License |
|---|---|---|---|---|---|
| Qwen3 32B | 32.5 B | Q4_K_M | 19.8 GB | 119 | Apache 2.0 |
| Qwen3 14B | 14.8 B | Q5_K_M | 10.9 GB | 119 | Apache 2.0 |
| Aya Expanse 32B | 32.3 B | Q4_K_M | 19.5 GB | 23 | CC-BY-NC 4.0 |
| Aya Expanse 8B | 8.0 B | Q5_K_M | 6.1 GB | 23 | CC-BY-NC 4.0 |
| Gemma 3 27B | 27.4 B | Q4_K_M | 17.2 GB | 140+ | Gemma |
| Gemma 3 12B | 12.2 B | Q5_K_M | 9.0 GB | 140+ | Gemma |
| Llama 3.3 70B | 70.6 B | Q4_K_M | 42.1 GB* | 8 officially | Llama 3.3 |
| Mistral Small 3.1 24B | 24.0 B | Q4_K_M | 14.8 GB | 11 | Apache 2.0 |
| EuroLLM 9B Instruct | 9.2 B | Q5_K_M | 7.0 GB | 35 EU+ | Apache 2.0 |
| NLLB-200 3.3B distilled | 3.3 B | FP16 | 6.8 GB | 200 | CC-BY-NC 4.0 |
| Tower Plus 9B | 9.0 B | Q5_K_M | 6.9 GB | 27 | Apache 2.0 |
*Llama 3.3 70B Q4_K_M needs a 48 GB card (RTX A6000, dual 3090, or one 5090). Everything else fits on a 24 GB consumer GPU; six models fit comfortably on a 16 GB card.
Overall Scores: Qwen3 32B Takes It
Averaged across all 52 pairs:
| Rank | Model | COMET-22 (avg) | Human fluency (1-4) | Tok/s (RTX 4090) | Languages ≥ 0.80 COMET |
|---|---|---|---|---|---|
| 1 | Qwen3 32B Q4_K_M | 0.847 | 3.62 | 34 | 41 / 52 |
| 2 | Aya Expanse 32B Q4_K_M | 0.838 | 3.58 | 36 | 38 / 52 |
| 3 | Gemma 3 27B Q4_K_M | 0.826 | 3.51 | 42 | 35 / 52 |
| 4 | Qwen3 14B Q5_K_M | 0.821 | 3.44 | 58 | 33 / 52 |
| 5 | Aya Expanse 8B Q5_K_M | 0.812 | 3.39 | 71 | 30 / 52 |
| 6 | Llama 3.3 70B Q4_K_M | 0.808 | 3.41 | 11 | 27 / 52 |
| 7 | Tower Plus 9B Q5_K_M | 0.804 | 3.36 | 68 | 26 / 52 |
| 8 | Gemma 3 12B Q5_K_M | 0.798 | 3.31 | 61 | 24 / 52 |
| 9 | EuroLLM 9B | 0.794 | 3.28 | 72 | 22 / 52 |
| 10 | Mistral Small 3.1 24B | 0.781 | 3.19 | 47 | 19 / 52 |
| 11 | NLLB-200 3.3B | 0.776 | 3.04 | 140 | 34 / 52 |
Qwen3 32B wins on aggregate, but NLLB-200's row deserves a second look: 34 languages crossing the 0.80 threshold despite a low average. That gap is the story of this whole article — average COMET hides which languages a model actually handles.
Per-Family Breakdown — Where the Picks Change
European languages (EN ↔ FR, DE, ES, IT, PT, NL, PL, RO)
It's a near tie at the top. Qwen3 32B, Aya Expanse 32B, and Gemma 3 27B finish within 0.6 COMET points. EuroLLM 9B punches three weight classes up here — it ties Gemma 3 27B on EN↔PL and EN↔RO, which makes sense given its EU-focused training corpus (see the EuroLLM-9B-Instruct model card).
CJK (Chinese, Japanese, Korean)
Qwen3 is in a class of its own. On EN↔ZH it scores 0.871 COMET to Aya Expanse 32B's 0.844 and Gemma 3 27B's 0.829. Llama 3.3 70B drops to 0.792 here despite its size — it produces grammatical Japanese but consistently swaps formality registers mid-document. Don't use it for any CJK pair where tone matters.
Arabic, Hebrew, Persian (RTL scripts)
Aya Expanse 32B wins this family at 0.831 average, slightly ahead of Qwen3 32B at 0.825. Aya's training corpus was deliberately weighted toward Arabic and Hebrew, which shows up in idiom handling. Gemma 3 27B is fine for Modern Standard Arabic but loses ground on dialectal samples.
Low-resource languages (Swahili, Yoruba, Khmer, Burmese, Uzbek, Pashto)
This is where general LLMs collapse and dedicated MT systems shine. NLLB-200 3.3B distilled averages 0.793 here against Qwen3 32B's 0.741 and Gemma 3 27B's 0.728. Aya Expanse hasn't been trained on most of these. Don't reach for a 32B chat model when a 3.3B encoder-decoder built for the job exists — see Meta's original No Language Left Behind paper for why specialized architectures still matter.
Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi)
Qwen3 32B leads at 0.819 with Gemma 3 27B at 0.811. Aya Expanse drops here — only Hindi is in its core 23. NLLB-200 again competitive at 0.802.
Speed and Cost: What It Actually Costs to Run Offline
Translation is usually batch work, so throughput matters more than latency. Here are real numbers on commodity hardware:
| Setup | Model | Tok/s | Words/hour | Approx. hardware cost | Power draw |
|---|---|---|---|---|---|
| RTX 4060 Ti 16 GB | Aya Expanse 8B Q5 | 41 | ~110,000 | $430 used | 165 W |
| RTX 4070 Ti Super 16 GB | Qwen3 14B Q5 | 52 | ~140,000 | $720 | 225 W |
| RTX 3090 24 GB | Qwen3 32B Q4 | 34 | ~92,000 | $680 used | 320 W |
| RTX 4090 24 GB | Qwen3 32B Q4 | 34 | ~92,000 | $1,750 | 340 W |
| 2× RTX 3090 (NVLink) | Llama 3.3 70B Q4 | 11 | ~30,000 | $1,360 used | 620 W |
| Mac Studio M3 Ultra 96 GB | Qwen3 32B MLX 4-bit | 28 | ~76,000 | $4,000 | 120 W |
For high-volume work — say, translating a 500,000-word corpus — Qwen3 32B on an RTX 3090 finishes in roughly 5.4 hours at a fully amortized cost well under $1 in electricity. Use our cost calculator to compare against API pricing for your specific volume; in our tests the local setup recoups its hardware cost versus DeepL Pro at around 12 million words.
How to Run the Winner: Qwen3 32B in 10 Minutes
Option A — Ollama (easiest)
ollama pull qwen3:32b-instruct-q4_K_M
ollama run qwen3:32b-instruct-q4_K_M "Translate to German: The contract is void on Sunday."Option B — llama.cpp with the GGUF directly
- Download the official GGUF from Qwen/Qwen3-32B-Instruct-GGUF.
- Build llama.cpp with CUDA:
cmake -B build -DGGML_CUDA=ON && cmake --build build -j. - Run a server:
./build/bin/llama-server -m Qwen3-32B-Instruct-Q4_K_M.gguf -ngl 65 -c 8192 --host 0.0.0.0. - POST to
/v1/chat/completions— the endpoint is OpenAI-compatible, so any translation client that speaks the OpenAI API will work.
Prompt template that actually helps quality
Across all models, a glossary prefix lifted COMET by 0.011 on average and by 0.034 on technical text. Keep it short:
You are a professional translator. Translate from {src} to {tgt}.
Glossary (use these terms exactly):
- {term_src_1} = {term_tgt_1}
- {term_src_2} = {term_tgt_2}
Preserve markdown, code blocks, and placeholders like {0} unchanged.
Output only the translation.
Source:
{text}If you're integrating into a tool that needs structured terminology control, our open-source quelllm-mcp server (a sister project to quelllm.fr) ships an MCP endpoint that wraps Qwen3 with glossary enforcement and translation-memory lookup.
The Verdict
| Your situation | Pick | Why |
|---|---|---|
| You want one model for everything, 24 GB VRAM | Qwen3 32B Q4_K_M | Highest overall COMET, strongest on CJK, Apache 2.0 |
| 16 GB VRAM, broad language needs | Aya Expanse 8B Q5_K_M | Best quality-per-GB, strong on Arabic and Hebrew (non-commercial only) |
| Commercial use, 16 GB VRAM | Qwen3 14B Q5_K_M | Apache 2.0, 0.821 COMET, 58 tok/s |
| EU languages only, budget hardware | EuroLLM 9B | Trained on EU corpus, ties 27B models on PL/RO |
| Low-resource languages (Swahili, Khmer, etc.) | NLLB-200 3.3B distilled | Beats every general LLM on 200 languages |
| You already own a 48 GB card | Still Qwen3 32B | Llama 3.3 70B doesn't justify its 4× compute cost for translation |
FAQ
Is Qwen3 32B really better than Llama 3.3 70B for translation?
Yes, by 3.9 COMET points on our 52-pair suite, and the gap widens to 7.9 points on CJK pairs. Llama 3.3 was trained primarily on 8 languages; Qwen3's training mix is explicitly multilingual across 119. Size doesn't fix a monolingual data distribution.
Why not just use Google Translate or DeepL?
Three reasons: data privacy (the local model never sends text off your machine), cost at scale (DeepL Pro starts around $25/month per user with character caps), and customization (you can fine-tune Qwen3 on your terminology — you can't fine-tune DeepL). Quality-wise, Qwen3 32B is within 1-2 COMET points of DeepL on EU pairs and ahead on CJK.
Can I run any of these on a Mac?
Yes. Qwen3 32B in MLX 4-bit runs at 28 tok/s on an M3 Ultra and 14 tok/s on a 64 GB M2 Max. Aya Expanse 8B runs at 35+ tok/s on any Apple Silicon with 16 GB unified memory.
Do quants hurt translation quality?
Q5_K_M is essentially lossless for translation in our tests — within 0.003 COMET of FP16. Q4_K_M loses about 0.008 COMET, which is below the human-perceptible threshold. Don't go below Q4 for translation work; Q3 and IQ2 quants produce noticeable terminology drift.
What about specialized translation models like Tower or ALMA?
Tower Plus 9B made our final ranking at #7. ALMA-13B was excluded because its CC-BY-NC license and 6-language scope made it less useful than Tower for the same audience. For pure translation under non-commercial use, both are reasonable; Tower is better on Slavic languages, ALMA on Romance.
How often do you re-run this benchmark?
Every quarter, or whenever a major open-weight release lands. All scores are published via our public API under CC BY 4.0 — see about for endpoint docs.
Bottom Line
For local translation in 2026, Qwen3 32B Q4_K_M is the default answer unless you have a specific reason to choose otherwise. The reasons that justify a different choice are clear: limited VRAM (Aya 8B or Qwen3 14B), EU-only workloads (EuroLLM 9B), or low-resource languages (NLLB-200). Llama 3.3 70B is not the answer here despite its size, and Mistral Small 3.1 is not competitive on multilingual work yet. Pin those picks, run our prompt template with a glossary, and you'll match or exceed commercial MT quality for every language pair that matters.