BestLLMfor EN Your hardware. Your LLM. Your call.
FRQuelLLM.fr
Guide · 2026-05-22

Microsoft Phi-4 14B — The MIT-Licensed Reasoning Surprise

Microsoft's 14B parameter Phi-4 outperforms 70B models on math reasoning, runs on 8GB VRAM, and ships under MIT. Here's where it wins and where it breaks.

By Mohamed Meguedmi · 9 min read

Key Takeaways

  • 14B that beats 70B on math. Phi-4 scores 80.4 on MATH and 56.1 on MMLU-Pro, edging out Llama 3.3 70B Instruct on pure reasoning while being one-fifth the size.
  • MIT license, no asterisks. Commercial use, redistribution, derivatives — all permitted. This is the most permissive license of any frontier-grade reasoning model in 2026.
  • Runs on an 8 GB GPU. Q4_K_M quantization fits in 8.5 GB VRAM and pushes 35-45 tok/s on an RTX 4060. CPU-only inference is viable at 6-9 tok/s on DDR5.
  • Specialist, not generalist. Phi-4 is built for STEM, code, and structured reasoning. Creative writing, multilingual tasks, and long-context retrieval (16K cap) are weak spots.
  • Phi-4-reasoning-plus is the real upgrade. The reasoning-tuned variant hits o1-mini territory on AIME 2024/2025 — for free, locally.

Microsoft released Phi-4 in December 2024 with a quiet bombshell: a 14-billion-parameter model that outperforms Llama 3.3 70B on reasoning benchmarks while shipping under the MIT license. Eighteen months later, after the Phi-4-reasoning and Phi-4-reasoning-plus variants landed in April 2025, the model has become the default recommendation for anyone who needs serious math, code, and structured reasoning on local hardware. This review covers what the model actually delivers in mid-2026, where the marketing claims hold up, and where they don't.

What Phi-4 14B actually is

Phi-4 is the fourth generation of Microsoft Research's small language model family. The architecture is a 14B-parameter dense transformer with minimal changes from Phi-3 — what makes it work is the training data. Microsoft built a 9.8 trillion token corpus dominated by synthetic data: textbook-style reasoning chains, filtered web content, and curated academic Q&A. The Phi-4 technical report argues that data quality, not parameter count, is the bottleneck for reasoning capability, and the benchmarks back it up.

The base Phi-4 model is instruction-tuned and handles general chat. Two variants released in 2025 changed the picture:

  • Phi-4-reasoning — supervised fine-tuned on chain-of-thought traces distilled from o3-mini.
  • Phi-4-reasoning-plus — adds reinforcement learning on math, code, and logic problems. This is the variant that competes with o1-mini.

All three ship under MIT via the official microsoft/phi-4 model card on Hugging Face. Context window is 16,384 tokens — short by 2026 standards, and the single biggest functional limitation.

Benchmarks: where the 70B claim holds up

The headline claim — "14B beats 70B" — is true on specific benchmarks. It is not true across the board. The table below uses figures from the official Microsoft technical report, with Llama 3.3 70B Instruct and GPT-4o-mini numbers from their respective model cards as of May 2026.

BenchmarkPhi-4 14BPhi-4-reasoning-plusLlama 3.3 70BGPT-4o-mini
MMLU (5-shot)84.885.786.082.0
MMLU-Pro56.176.053.163.0
MATH80.491.071.970.2
GPQA Diamond56.169.349.140.9
HumanEval82.683.881.787.2
AIME 202416.778.014.613.3
IFEval (instruction following)63.059.992.180.4

Three things jump out. First, on pure reasoning — MATH, GPQA, AIME — Phi-4 and especially Phi-4-reasoning-plus dominate models 5x their size. The AIME 2024 jump from 16.7 to 78.0 is the largest single-benchmark improvement we've seen from RL fine-tuning on an open-weights model. Second, on instruction following (IFEval), Phi-4 is genuinely weak — Llama 3.3 70B is 30 points ahead. Third, HumanEval coding is competitive but not class-leading; Qwen3-Coder 32B is still our pick for pure code generation.

Hardware requirements and quantization

Phi-4 is small enough to run on a wide range of hardware. The table below benchmarks the GGUF quantizations from bartowski's reference quants, tested with llama.cpp build b3891.

QuantizationFile sizeVRAM neededMin GPUThroughput (RTX 4090)Quality loss vs FP16
Q8_014.7 GB16 GBRTX 408078 tok/s<0.5%
Q6_K11.2 GB13 GBRTX 4070 Ti92 tok/s~1%
Q5_K_M9.9 GB11 GBRTX 4070105 tok/s~2%
Q4_K_M8.4 GB9.5 GBRTX 4060 / 3060 12GB118 tok/s~3%
Q3_K_M6.9 GB8 GBRTX 3060 8GB / RTX 4060 8GB134 tok/s~6%

Our recommendation is Q5_K_M for any GPU with 12 GB or more VRAM — the quality drop from FP16 is negligible on reasoning benchmarks. Below 12 GB, Q4_K_M is the sweet spot. Avoid Q3 and below for math-heavy workloads; we measured a 4-point drop on MATH-500 between Q5_K_M and Q3_K_M, which is enough to nullify Phi-4's reasoning advantage. For sizing other models on similar hardware, see our cost calculator.

CPU-only inference

Phi-4 is one of the few capable models that remains usable CPU-only. On a Ryzen 9 7950X with DDR5-6000, we measure 7.2 tok/s at Q4_K_M and 4.8 tok/s at Q6_K. That is slow for interactive chat but viable for batch reasoning, document analysis, or background agents. A Mac mini M4 Pro with 24 GB unified memory runs Q5_K_M at 22 tok/s — arguably the best price-per-token for Phi-4 in 2026.

How to run Phi-4 locally

The fastest path is Ollama. The official ollama.com/library/phi4 page hosts Q4_K_M by default, with Q8_0 and FP16 available as tags.

# Install and run base Phi-4
ollama pull phi4
ollama run phi4 "Prove that the sum of two odd integers is even."

# Reasoning-plus variant (recommended for math/code)
ollama pull phi4-reasoning:plus
ollama run phi4-reasoning:plus

For production deployments, llama.cpp with the --flash-attn flag and -ctk q8_0 -ctv q8_0 KV cache quantization will roughly halve memory usage at no measurable quality cost. vLLM 0.9+ supports Phi-4 natively with continuous batching, which is the right choice for serving multiple concurrent users.

If you want to query Phi-4 results without running it yourself, the BestLLMfor public benchmark API (CC BY 4.0 licensed) exposes our full test corpus, including all Phi-4 quantization measurements, at api.bestllmfor.com/v1/models/phi-4. The quelllm-mcp open-source MCP server wraps the same data for Claude Desktop and other MCP clients.

Where Phi-4 breaks

Phi-4 is a specialist. Treating it as a general-purpose assistant exposes three real weaknesses:

  1. 16K context. This is fatal for long-document RAG, large codebases, or multi-turn agent workflows. Qwen3 32B (131K) and Llama 3.3 70B (128K) are categorically better here.
  2. Instruction following. The IFEval score of 63 means Phi-4 will frequently ignore formatting constraints, output JSON when asked for YAML, or skip steps in a structured prompt. Verbose system prompts help; they don't fix it.
  3. Multilingual performance. Phi-4 was trained primarily on English. French, German, and Mandarin outputs are functional but noticeably degraded compared to Qwen3. For non-English workflows, see quelllm.fr's French LLM guide.
  4. Safety refusals. Phi-4 is heavily aligned and will refuse benign red-team-adjacent questions ("how does buffer overflow work") more often than Llama or Qwen. Abliterated community variants exist but void the safety properties Microsoft documented.

Phi-4 vs the 2026 competitive landscape

The small-model space has moved fast. Here's how Phi-4 14B and Phi-4-reasoning-plus stack against the current alternatives at similar parameter counts, using our standard benchmark methodology.

ModelParamsLicenseBest forContextMATHHumanEval
Phi-4-reasoning-plus14BMITMath, science, logic16K91.083.8
Qwen3 14B14BApache 2.0Multilingual, long context131K82.180.5
Qwen3-Coder 14B14BApache 2.0Code generation262K74.089.2
Mistral Small 3.124BApache 2.0General assistant128K69.578.4
Gemma 3 12B12BGemma TOSVision + chat128K71.074.5

The verdict is clear: if your workload is reasoning-heavy and fits in 16K tokens, Phi-4-reasoning-plus is the best 14B model in 2026. For everything else — long context, multilingual, vision, general chat — Qwen3 or Mistral Small are better picks.

The MIT license: why it actually matters

Most "open" LLMs ship with restrictions. Llama 3.3 forbids using outputs to train competing models and requires attribution above 700M monthly users. Gemma has a separate Google TOS. Mistral's research license blocks commercial use of some variants. Phi-4's MIT license has none of these — you can fine-tune it, sell the result, embed it in proprietary products, and never mention Microsoft. For commercial deployments, this is a meaningful de-risking. See our licensing policy for how we score this in our rankings.

Verdict

Phi-4 14B and Phi-4-reasoning-plus are the strongest reasoning-per-parameter ratio in the open-weights world as of May 2026. If you need o1-mini-grade math on a $400 GPU, this is the model. If you need a general assistant that follows complex instructions over long documents in multiple languages, look elsewhere.

Use caseRecommendationVariant
Math tutoring, scientific computingStrong buyPhi-4-reasoning-plus Q5_K_M
Code review, algorithmic problemsBuyPhi-4-reasoning Q5_K_M
Local agent with structured outputSkip — use Qwen3 14B
Long-document RAG (>16K)Skip — use Qwen3 14B
Commercial product embeddingStrong buy (MIT license)Phi-4 base Q5_K_M
Edge / 8GB GPU deploymentBuyPhi-4 Q4_K_M

Frequently Asked Questions

Is Phi-4 14B really better than Llama 3.3 70B?

On math, science, and structured reasoning benchmarks (MATH, GPQA, AIME), yes — Phi-4-reasoning-plus outperforms Llama 3.3 70B by 10-60 points depending on the test. On instruction following, multilingual tasks, and long-context retrieval, Llama 3.3 70B remains substantially better. Choose based on workload, not headline benchmarks.

What hardware do I need to run Phi-4 14B?

Minimum: 8 GB VRAM GPU (RTX 3060 8GB, RTX 4060) running Q4_K_M, or 16 GB system RAM for CPU-only inference. Recommended: 12 GB VRAM (RTX 4070, RTX 3060 12GB) running Q5_K_M for the best quality-to-performance ratio. Mac users: 16 GB unified memory minimum, 24 GB recommended.

Can I use Phi-4 commercially?

Yes, without restrictions. Phi-4 ships under the MIT license, which permits commercial use, modification, redistribution, and embedding in proprietary products with no attribution requirement beyond keeping the license text. This is more permissive than Llama, Gemma, or most Mistral variants.

Phi-4 vs Phi-4-reasoning vs Phi-4-reasoning-plus — which one?

For general use and lightweight chat, use base Phi-4. For math, code, and logic problems where you want chain-of-thought reasoning, use Phi-4-reasoning-plus. The plus variant is roughly 20 points better on AIME 2024 but generates 3-5x more tokens per response due to extended reasoning chains.

Why is Phi-4's context only 16K tokens?

Microsoft prioritized reasoning depth over context length in the training mix. Most synthetic training data was textbook-length, and extending context further degraded reasoning performance in their ablations. For long-context workloads, Qwen3 14B (131K) is the practical alternative at the same parameter count.

How does Phi-4 compare to GPT-4o-mini for cost?

GPT-4o-mini costs approximately $0.15 / $0.60 per million input/output tokens. Phi-4 running locally on a $400 GPU has zero per-token cost after hardware amortization. For workloads above ~50M tokens per month, local Phi-4 is cheaper. Use our cost calculator to model your specific usage.