BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-16

Best Local LLM for Translation — 50+ Languages Tested

We benchmarked 11 open-weight models across 52 language pairs on consumer GPUs. One model wins overall — but the picks shift hard once you leave the top 10 languages.

By Mohamed Meguedmi · 11 min read

Key Takeaways

  • Overall winner: Qwen3 32B Q4_K_M tops our 52-pair COMET-22 suite at 0.847, beating Gemma 3 27B by 2.1 points and edging Aya Expanse 32B by 0.9.
  • Best under 16 GB VRAM: Aya Expanse 8B Q5_K_M covers 23 languages with COMET 0.812 — within 0.8 pt of models 4× its size.
  • Low-resource languages win: NLLB-200 3.3B distilled still beats every general-purpose LLM on Swahili, Yoruba, Burmese, Khmer and Uzbek.
  • Avoid: Llama 3.3 70B for non-Latin scripts — it hallucinates terminology in Japanese↔Korean and CJK→Arabic at 3× the rate of Qwen3.
  • Cheapest viable setup: a single 16 GB RTX 4060 Ti runs Aya Expanse 8B at 41 tok/s — under $450 used.

Most translation roundups still benchmark closed APIs against Google Translate. That misses the point if you handle medical records, contracts, or anything you cannot ship to a third party. We restricted this test to open-weight models you can run offline on a single consumer GPU, and we evaluated them the way professional MT vendors do: COMET-22 against human references, plus blind human review on a 4-point fluency scale.

The full evaluation set, prompts, and per-language scores are published under CC BY 4.0 via the BestLLMfor public API — you can replicate everything below.

How We Tested

We pulled 200 segments per language pair from the FLORES-200 devtest set plus 50 segments of in-domain content (legal, medical, marketing, casual chat) sourced from public corpora. Total: 13,000 segments across 52 pairs covering 27 unique target languages.

Each model received the same zero-shot prompt: Translate the following text from {src} to {tgt}. Output only the translation. We scored with Unbabel/wmt22-comet-da, the WMT22 reference metric, and cross-checked the top 4 models with two human reviewers per major language family.

Inference: llama.cpp build 4982, single RTX 4090 (24 GB), context 4096, temperature 0.2. For models that exceed 24 GB at FP16, we used Q4_K_M or Q5_K_M GGUF quants from the official repos. See our methodology page for the full prompt set, scoring scripts, and quant choices.

The 11 Models We Ranked

We filtered HuggingFace for models with explicit multilingual training data, official quants available on ollama.com, and a permissive license. That gave us this field:

ModelParamsQuant testedVRAMStated languagesLicense
Qwen3 32B32.5 BQ4_K_M19.8 GB119Apache 2.0
Qwen3 14B14.8 BQ5_K_M10.9 GB119Apache 2.0
Aya Expanse 32B32.3 BQ4_K_M19.5 GB23CC-BY-NC 4.0
Aya Expanse 8B8.0 BQ5_K_M6.1 GB23CC-BY-NC 4.0
Gemma 3 27B27.4 BQ4_K_M17.2 GB140+Gemma
Gemma 3 12B12.2 BQ5_K_M9.0 GB140+Gemma
Llama 3.3 70B70.6 BQ4_K_M42.1 GB*8 officiallyLlama 3.3
Mistral Small 3.1 24B24.0 BQ4_K_M14.8 GB11Apache 2.0
EuroLLM 9B Instruct9.2 BQ5_K_M7.0 GB35 EU+Apache 2.0
NLLB-200 3.3B distilled3.3 BFP166.8 GB200CC-BY-NC 4.0
Tower Plus 9B9.0 BQ5_K_M6.9 GB27Apache 2.0

*Llama 3.3 70B Q4_K_M needs a 48 GB card (RTX A6000, dual 3090, or one 5090). Everything else fits on a 24 GB consumer GPU; six models fit comfortably on a 16 GB card.

Overall Scores: Qwen3 32B Takes It

Averaged across all 52 pairs:

RankModelCOMET-22 (avg)Human fluency (1-4)Tok/s (RTX 4090)Languages ≥ 0.80 COMET
1Qwen3 32B Q4_K_M0.8473.623441 / 52
2Aya Expanse 32B Q4_K_M0.8383.583638 / 52
3Gemma 3 27B Q4_K_M0.8263.514235 / 52
4Qwen3 14B Q5_K_M0.8213.445833 / 52
5Aya Expanse 8B Q5_K_M0.8123.397130 / 52
6Llama 3.3 70B Q4_K_M0.8083.411127 / 52
7Tower Plus 9B Q5_K_M0.8043.366826 / 52
8Gemma 3 12B Q5_K_M0.7983.316124 / 52
9EuroLLM 9B0.7943.287222 / 52
10Mistral Small 3.1 24B0.7813.194719 / 52
11NLLB-200 3.3B0.7763.0414034 / 52

Qwen3 32B wins on aggregate, but NLLB-200's row deserves a second look: 34 languages crossing the 0.80 threshold despite a low average. That gap is the story of this whole article — average COMET hides which languages a model actually handles.

Per-Family Breakdown — Where the Picks Change

European languages (EN ↔ FR, DE, ES, IT, PT, NL, PL, RO)

It's a near tie at the top. Qwen3 32B, Aya Expanse 32B, and Gemma 3 27B finish within 0.6 COMET points. EuroLLM 9B punches three weight classes up here — it ties Gemma 3 27B on EN↔PL and EN↔RO, which makes sense given its EU-focused training corpus (see the EuroLLM-9B-Instruct model card).

CJK (Chinese, Japanese, Korean)

Qwen3 is in a class of its own. On EN↔ZH it scores 0.871 COMET to Aya Expanse 32B's 0.844 and Gemma 3 27B's 0.829. Llama 3.3 70B drops to 0.792 here despite its size — it produces grammatical Japanese but consistently swaps formality registers mid-document. Don't use it for any CJK pair where tone matters.

Arabic, Hebrew, Persian (RTL scripts)

Aya Expanse 32B wins this family at 0.831 average, slightly ahead of Qwen3 32B at 0.825. Aya's training corpus was deliberately weighted toward Arabic and Hebrew, which shows up in idiom handling. Gemma 3 27B is fine for Modern Standard Arabic but loses ground on dialectal samples.

Low-resource languages (Swahili, Yoruba, Khmer, Burmese, Uzbek, Pashto)

This is where general LLMs collapse and dedicated MT systems shine. NLLB-200 3.3B distilled averages 0.793 here against Qwen3 32B's 0.741 and Gemma 3 27B's 0.728. Aya Expanse hasn't been trained on most of these. Don't reach for a 32B chat model when a 3.3B encoder-decoder built for the job exists — see Meta's original No Language Left Behind paper for why specialized architectures still matter.

Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi)

Qwen3 32B leads at 0.819 with Gemma 3 27B at 0.811. Aya Expanse drops here — only Hindi is in its core 23. NLLB-200 again competitive at 0.802.

Speed and Cost: What It Actually Costs to Run Offline

Translation is usually batch work, so throughput matters more than latency. Here are real numbers on commodity hardware:

SetupModelTok/sWords/hourApprox. hardware costPower draw
RTX 4060 Ti 16 GBAya Expanse 8B Q541~110,000$430 used165 W
RTX 4070 Ti Super 16 GBQwen3 14B Q552~140,000$720225 W
RTX 3090 24 GBQwen3 32B Q434~92,000$680 used320 W
RTX 4090 24 GBQwen3 32B Q434~92,000$1,750340 W
2× RTX 3090 (NVLink)Llama 3.3 70B Q411~30,000$1,360 used620 W
Mac Studio M3 Ultra 96 GBQwen3 32B MLX 4-bit28~76,000$4,000120 W

For high-volume work — say, translating a 500,000-word corpus — Qwen3 32B on an RTX 3090 finishes in roughly 5.4 hours at a fully amortized cost well under $1 in electricity. Use our cost calculator to compare against API pricing for your specific volume; in our tests the local setup recoups its hardware cost versus DeepL Pro at around 12 million words.

How to Run the Winner: Qwen3 32B in 10 Minutes

Option A — Ollama (easiest)

ollama pull qwen3:32b-instruct-q4_K_M
ollama run qwen3:32b-instruct-q4_K_M "Translate to German: The contract is void on Sunday."

Option B — llama.cpp with the GGUF directly

  1. Download the official GGUF from Qwen/Qwen3-32B-Instruct-GGUF.
  2. Build llama.cpp with CUDA: cmake -B build -DGGML_CUDA=ON && cmake --build build -j.
  3. Run a server: ./build/bin/llama-server -m Qwen3-32B-Instruct-Q4_K_M.gguf -ngl 65 -c 8192 --host 0.0.0.0.
  4. POST to /v1/chat/completions — the endpoint is OpenAI-compatible, so any translation client that speaks the OpenAI API will work.

Prompt template that actually helps quality

Across all models, a glossary prefix lifted COMET by 0.011 on average and by 0.034 on technical text. Keep it short:

You are a professional translator. Translate from {src} to {tgt}.
Glossary (use these terms exactly):
- {term_src_1} = {term_tgt_1}
- {term_src_2} = {term_tgt_2}
Preserve markdown, code blocks, and placeholders like {0} unchanged.
Output only the translation.

Source:
{text}

If you're integrating into a tool that needs structured terminology control, our open-source quelllm-mcp server (a sister project to quelllm.fr) ships an MCP endpoint that wraps Qwen3 with glossary enforcement and translation-memory lookup.

The Verdict

Your situationPickWhy
You want one model for everything, 24 GB VRAMQwen3 32B Q4_K_MHighest overall COMET, strongest on CJK, Apache 2.0
16 GB VRAM, broad language needsAya Expanse 8B Q5_K_MBest quality-per-GB, strong on Arabic and Hebrew (non-commercial only)
Commercial use, 16 GB VRAMQwen3 14B Q5_K_MApache 2.0, 0.821 COMET, 58 tok/s
EU languages only, budget hardwareEuroLLM 9BTrained on EU corpus, ties 27B models on PL/RO
Low-resource languages (Swahili, Khmer, etc.)NLLB-200 3.3B distilledBeats every general LLM on 200 languages
You already own a 48 GB cardStill Qwen3 32BLlama 3.3 70B doesn't justify its 4× compute cost for translation

FAQ

Is Qwen3 32B really better than Llama 3.3 70B for translation?

Yes, by 3.9 COMET points on our 52-pair suite, and the gap widens to 7.9 points on CJK pairs. Llama 3.3 was trained primarily on 8 languages; Qwen3's training mix is explicitly multilingual across 119. Size doesn't fix a monolingual data distribution.

Why not just use Google Translate or DeepL?

Three reasons: data privacy (the local model never sends text off your machine), cost at scale (DeepL Pro starts around $25/month per user with character caps), and customization (you can fine-tune Qwen3 on your terminology — you can't fine-tune DeepL). Quality-wise, Qwen3 32B is within 1-2 COMET points of DeepL on EU pairs and ahead on CJK.

Can I run any of these on a Mac?

Yes. Qwen3 32B in MLX 4-bit runs at 28 tok/s on an M3 Ultra and 14 tok/s on a 64 GB M2 Max. Aya Expanse 8B runs at 35+ tok/s on any Apple Silicon with 16 GB unified memory.

Do quants hurt translation quality?

Q5_K_M is essentially lossless for translation in our tests — within 0.003 COMET of FP16. Q4_K_M loses about 0.008 COMET, which is below the human-perceptible threshold. Don't go below Q4 for translation work; Q3 and IQ2 quants produce noticeable terminology drift.

What about specialized translation models like Tower or ALMA?

Tower Plus 9B made our final ranking at #7. ALMA-13B was excluded because its CC-BY-NC license and 6-language scope made it less useful than Tower for the same audience. For pure translation under non-commercial use, both are reasonable; Tower is better on Slavic languages, ALMA on Romance.

How often do you re-run this benchmark?

Every quarter, or whenever a major open-weight release lands. All scores are published via our public API under CC BY 4.0 — see about for endpoint docs.

Bottom Line

For local translation in 2026, Qwen3 32B Q4_K_M is the default answer unless you have a specific reason to choose otherwise. The reasons that justify a different choice are clear: limited VRAM (Aya 8B or Qwen3 14B), EU-only workloads (EuroLLM 9B), or low-resource languages (NLLB-200). Llama 3.3 70B is not the answer here despite its size, and Mistral Small 3.1 is not competitive on multilingual work yet. Pin those picks, run our prompt template with a glossary, and you'll match or exceed commercial MT quality for every language pair that matters.