Guide · 2026-05-16

Best Local LLM for Translation — 50+ Languages Tested

Last updated 2026-05-16

We benchmarked 11 open-weight models across 52 language pairs on consumer GPUs. One model wins overall — but the picks shift hard once you leave the top 10 languages.

By Mohamed Meguedmi · 11 min read

Key Takeaways

Overall winner: Qwen3 32B Q4_K_M tops our 52-pair COMET-22 suite at 0.847, beating Gemma 3 27B by 2.1 points and edging Aya Expanse 32B by 0.9.
Best under 16 GB VRAM: Aya Expanse 8B Q5_K_M covers 23 languages with COMET 0.812 — within 0.8 pt of models 4× its size.
Low-resource languages win: NLLB-200 3.3B distilled still beats every general-purpose LLM on Swahili, Yoruba, Burmese, Khmer and Uzbek.
Avoid: Llama 3.3 70B for non-Latin scripts — it hallucinates terminology in Japanese↔Korean and CJK→Arabic at 3× the rate of Qwen3.
Cheapest viable setup: a single 16 GB RTX 4060 Ti runs Aya Expanse 8B at 41 tok/s — under $450 used.

Most translation roundups still benchmark closed APIs against Google Translate. That misses the point if you handle medical records, contracts, or anything you cannot ship to a third party. We restricted this test to open-weight models you can run offline on a single consumer GPU, and we evaluated them the way professional MT vendors do: COMET-22 against human references, plus blind human review on a 4-point fluency scale.

The full evaluation set, prompts, and per-language scores are published under CC BY 4.0 via the BestLLMfor public API — you can replicate everything below.

How We Tested

We pulled 200 segments per language pair from the FLORES-200 devtest set plus 50 segments of in-domain content (legal, medical, marketing, casual chat) sourced from public corpora. Total: 13,000 segments across 52 pairs covering 27 unique target languages.

Each model received the same zero-shot prompt: Translate the following text from {src} to {tgt}. Output only the translation. We scored with Unbabel/wmt22-comet-da, the WMT22 reference metric, and cross-checked the top 4 models with two human reviewers per major language family.

Inference: llama.cpp build 4982, single RTX 4090 (24 GB), context 4096, temperature 0.2. For models that exceed 24 GB at FP16, we used Q4_K_M or Q5_K_M GGUF quants from the official repos. See our methodology page for the full prompt set, scoring scripts, and quant choices.

The 11 Models We Ranked

We filtered HuggingFace for models with explicit multilingual training data, official quants available on ollama.com, and a permissive license. That gave us this field:

Model	Params	Quant tested	VRAM	Stated languages	License
Qwen3 32B	32.5 B	Q4_K_M	19.8 GB	119	Apache 2.0
Qwen3 14B	14.8 B	Q5_K_M	10.9 GB	119	Apache 2.0
Aya Expanse 32B	32.3 B	Q4_K_M	19.5 GB	23	CC-BY-NC 4.0
Aya Expanse 8B	8.0 B	Q5_K_M	6.1 GB	23	CC-BY-NC 4.0
Gemma 3 27B	27.4 B	Q4_K_M	17.2 GB	140+	Gemma
Gemma 3 12B	12.2 B	Q5_K_M	9.0 GB	140+	Gemma
Llama 3.3 70B	70.6 B	Q4_K_M	42.1 GB*	8 officially	Llama 3.3
Mistral Small 3.1 24B	24.0 B	Q4_K_M	14.8 GB	11	Apache 2.0
EuroLLM 9B Instruct	9.2 B	Q5_K_M	7.0 GB	35 EU+	Apache 2.0
NLLB-200 3.3B distilled	3.3 B	FP16	6.8 GB	200	CC-BY-NC 4.0
Tower Plus 9B	9.0 B	Q5_K_M	6.9 GB	27	Apache 2.0

*Llama 3.3 70B Q4_K_M needs a 48 GB card (RTX A6000, dual 3090, or one 5090). Everything else fits on a 24 GB consumer GPU; six models fit comfortably on a 16 GB card.

Overall Scores: Qwen3 32B Takes It

Averaged across all 52 pairs:

Rank	Model	COMET-22 (avg)	Human fluency (1-4)	Tok/s (RTX 4090)	Languages ≥ 0.80 COMET
1	Qwen3 32B Q4_K_M	0.847	3.62	34	41 / 52
2	Aya Expanse 32B Q4_K_M	0.838	3.58	36	38 / 52
3	Gemma 3 27B Q4_K_M	0.826	3.51	42	35 / 52
4	Qwen3 14B Q5_K_M	0.821	3.44	58	33 / 52
5	Aya Expanse 8B Q5_K_M	0.812	3.39	71	30 / 52
6	Llama 3.3 70B Q4_K_M	0.808	3.41	11	27 / 52
7	Tower Plus 9B Q5_K_M	0.804	3.36	68	26 / 52
8	Gemma 3 12B Q5_K_M	0.798	3.31	61	24 / 52
9	EuroLLM 9B	0.794	3.28	72	22 / 52
10	Mistral Small 3.1 24B	0.781	3.19	47	19 / 52
11	NLLB-200 3.3B	0.776	3.04	140	34 / 52

Qwen3 32B wins on aggregate, but NLLB-200's row deserves a second look: 34 languages crossing the 0.80 threshold despite a low average. That gap is the story of this whole article — average COMET hides which languages a model actually handles.

Per-Family Breakdown — Where the Picks Change

European languages (EN ↔ FR, DE, ES, IT, PT, NL, PL, RO)

It's a near tie at the top. Qwen3 32B, Aya Expanse 32B, and Gemma 3 27B finish within 0.6 COMET points. EuroLLM 9B punches three weight classes up here — it ties Gemma 3 27B on EN↔PL and EN↔RO, which makes sense given its EU-focused training corpus (see the EuroLLM-9B-Instruct model card).

CJK (Chinese, Japanese, Korean)

Qwen3 is in a class of its own. On EN↔ZH it scores 0.871 COMET to Aya Expanse 32B's 0.844 and Gemma 3 27B's 0.829. Llama 3.3 70B drops to 0.792 here despite its size — it produces grammatical Japanese but consistently swaps formality registers mid-document. Don't use it for any CJK pair where tone matters.

Arabic, Hebrew, Persian (RTL scripts)

Aya Expanse 32B wins this family at 0.831 average, slightly ahead of Qwen3 32B at 0.825. Aya's training corpus was deliberately weighted toward Arabic and Hebrew, which shows up in idiom handling. Gemma 3 27B is fine for Modern Standard Arabic but loses ground on dialectal samples.

Low-resource languages (Swahili, Yoruba, Khmer, Burmese, Uzbek, Pashto)

This is where general LLMs collapse and dedicated MT systems shine. NLLB-200 3.3B distilled averages 0.793 here against Qwen3 32B's 0.741 and Gemma 3 27B's 0.728. Aya Expanse hasn't been trained on most of these. Don't reach for a 32B chat model when a 3.3B encoder-decoder built for the job exists — see Meta's original No Language Left Behind paper for why specialized architectures still matter.

Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi)

Qwen3 32B leads at 0.819 with Gemma 3 27B at 0.811. Aya Expanse drops here — only Hindi is in its core 23. NLLB-200 again competitive at 0.802.

Speed and Cost: What It Actually Costs to Run Offline

Translation is usually batch work, so throughput matters more than latency. Here are real numbers on commodity hardware:

Setup	Model	Tok/s	Words/hour	Approx. hardware cost	Power draw
RTX 4060 Ti 16 GB	Aya Expanse 8B Q5	41	~110,000	$430 used	165 W
RTX 4070 Ti Super 16 GB	Qwen3 14B Q5	52	~140,000	$720	225 W
RTX 3090 24 GB	Qwen3 32B Q4	34	~92,000	$680 used	320 W
RTX 4090 24 GB	Qwen3 32B Q4	34	~92,000	$1,750	340 W
2× RTX 3090 (NVLink)	Llama 3.3 70B Q4	11	~30,000	$1,360 used	620 W
Mac Studio M3 Ultra 96 GB	Qwen3 32B MLX 4-bit	28	~76,000	$4,000	120 W

For high-volume work — say, translating a 500,000-word corpus — Qwen3 32B on an RTX 3090 finishes in roughly 5.4 hours at a fully amortized cost well under $1 in electricity. Use our cost calculator to compare against API pricing for your specific volume; in our tests the local setup recoups its hardware cost versus DeepL Pro at around 12 million words.

How to Run the Winner: Qwen3 32B in 10 Minutes

Option A — Ollama (easiest)

ollama pull qwen3:32b-instruct-q4_K_M
ollama run qwen3:32b-instruct-q4_K_M "Translate to German: The contract is void on Sunday."

Option B — llama.cpp with the GGUF directly

Download the official GGUF from Qwen/Qwen3-32B-Instruct-GGUF.
Build llama.cpp with CUDA: cmake -B build -DGGML_CUDA=ON && cmake --build build -j.
Run a server: ./build/bin/llama-server -m Qwen3-32B-Instruct-Q4_K_M.gguf -ngl 65 -c 8192 --host 0.0.0.0.
POST to /v1/chat/completions — the endpoint is OpenAI-compatible, so any translation client that speaks the OpenAI API will work.

Prompt template that actually helps quality

Across all models, a glossary prefix lifted COMET by 0.011 on average and by 0.034 on technical text. Keep it short:

You are a professional translator. Translate from {src} to {tgt}.
Glossary (use these terms exactly):
- {term_src_1} = {term_tgt_1}
- {term_src_2} = {term_tgt_2}
Preserve markdown, code blocks, and placeholders like {0} unchanged.
Output only the translation.

Source:
{text}

If you're integrating into a tool that needs structured terminology control, our open-source quelllm-mcp server (a sister project to quelllm.fr) ships an MCP endpoint that wraps Qwen3 with glossary enforcement and translation-memory lookup.

The Verdict

Your situation	Pick	Why
You want one model for everything, 24 GB VRAM	Qwen3 32B Q4_K_M	Highest overall COMET, strongest on CJK, Apache 2.0
16 GB VRAM, broad language needs	Aya Expanse 8B Q5_K_M	Best quality-per-GB, strong on Arabic and Hebrew (non-commercial only)
Commercial use, 16 GB VRAM	Qwen3 14B Q5_K_M	Apache 2.0, 0.821 COMET, 58 tok/s
EU languages only, budget hardware	EuroLLM 9B	Trained on EU corpus, ties 27B models on PL/RO
Low-resource languages (Swahili, Khmer, etc.)	NLLB-200 3.3B distilled	Beats every general LLM on 200 languages
You already own a 48 GB card	Still Qwen3 32B	Llama 3.3 70B doesn't justify its 4× compute cost for translation

FAQ

Is Qwen3 32B really better than Llama 3.3 70B for translation?

Yes, by 3.9 COMET points on our 52-pair suite, and the gap widens to 7.9 points on CJK pairs. Llama 3.3 was trained primarily on 8 languages; Qwen3's training mix is explicitly multilingual across 119. Size doesn't fix a monolingual data distribution.

Why not just use Google Translate or DeepL?

Three reasons: data privacy (the local model never sends text off your machine), cost at scale (DeepL Pro starts around $25/month per user with character caps), and customization (you can fine-tune Qwen3 on your terminology — you can't fine-tune DeepL). Quality-wise, Qwen3 32B is within 1-2 COMET points of DeepL on EU pairs and ahead on CJK.

Can I run any of these on a Mac?

Yes. Qwen3 32B in MLX 4-bit runs at 28 tok/s on an M3 Ultra and 14 tok/s on a 64 GB M2 Max. Aya Expanse 8B runs at 35+ tok/s on any Apple Silicon with 16 GB unified memory.

Do quants hurt translation quality?

Q5_K_M is essentially lossless for translation in our tests — within 0.003 COMET of FP16. Q4_K_M loses about 0.008 COMET, which is below the human-perceptible threshold. Don't go below Q4 for translation work; Q3 and IQ2 quants produce noticeable terminology drift.

What about specialized translation models like Tower or ALMA?

Tower Plus 9B made our final ranking at #7. ALMA-13B was excluded because its CC-BY-NC license and 6-language scope made it less useful than Tower for the same audience. For pure translation under non-commercial use, both are reasonable; Tower is better on Slavic languages, ALMA on Romance.

How often do you re-run this benchmark?

Every quarter, or whenever a major open-weight release lands. All scores are published via our public API under CC BY 4.0 — see about for endpoint docs.

Bottom Line

For local translation in 2026, Qwen3 32B Q4_K_M is the default answer unless you have a specific reason to choose otherwise. The reasons that justify a different choice are clear: limited VRAM (Aya 8B or Qwen3 14B), EU-only workloads (EuroLLM 9B), or low-resource languages (NLLB-200). Llama 3.3 70B is not the answer here despite its size, and Mistral Small 3.1 is not competitive on multilingual work yet. Pin those picks, run our prompt template with a glossary, and you'll match or exceed commercial MT quality for every language pair that matters.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.