How we measure

Methodology — open by default

Everything you see on this site is sourced from a public protocol. Hardware, model versions, runtime config, and prompt sets are documented — so you can reproduce or challenge the numbers.

1. Catalog data — how models are indexed

Models are pulled from public sources — Hugging Face, Ollama library, vendor model cards — and reviewed daily by an automated pipeline.

Each model entry exposes: parameters count, quantization variants, license, VRAM requirement at common quants (Q4_K_M, Q5_K_M, Q8_0, FP16), context length, and one-line use case.

The full catalog is exposed under CC BY 4.0 at bestllmfor.com/api/models.json — free to reuse with attribution.

2. Benchmarks — tokens/second on real hardware

Two-tier measurement system:

Tier 1 — measured: models we ran ourselves through the benchmark pipeline at a documented quant and context length. Numbers in tokens/sec, with cold-start and warm-start variants.
Tier 2 — profiled: estimates derived from a "neighbor profile" (a similar architecture model we did run), normalized by parameter count and quant. Always labeled as estimate.

Prompt set is 5 standard prompts spanning code generation, summarization, French translation, JSON output, and long-context retrieval. Each prompt is run 3 times, median reported.

3. VRAM requirements — how we estimate

VRAM at a given quantization is computed from:

Model size (parameters × bits-per-weight at quant)
KV cache size (context length × layers × heads, fp16)
Runtime overhead (~600 MB for Ollama + llama.cpp baseline)

Formulas and constants are in the open-source MCP server repository, function estimate_vram(model_id, quant, context).

4. Cost calculator — pricing sources

API prices are pulled from vendor pricing pages and cross-checked semestrially. Currently indexed:

OpenAI — GPT-5, GPT-5 mini
Anthropic — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5
Google — Gemini 2.5 Pro, Flash
DeepSeek — V3.5, R1
Mistral — Large 2.5, Small 3.5

Self-hosted cost is computed from hardware amortization (you set the months), electricity (you set the €/kWh or $/kWh), GPU TDP, and active hours/day. Output is the monthly break-even token volume.

5. Editorial process — rankings & recommendations

Rankings (e.g. "Best local LLM for coding on RTX 4090") are produced by a hybrid process:

Filter the catalog by hard constraints (fits in VRAM, license compatible)
Score remaining candidates against task-specific benchmarks (SWE-bench for coding, MMLU for reasoning, FLORES-200 for translation, etc.)
Manual review by the author with notes on quirks (regression bugs, prompt sensitivity, refusal rate)
Public source for every claim: model card, paper, or our own benchmark

No paid placement. No referral kickbacks. If a model leads a ranking, the numbers in the article say why.

6. Reproducing our numbers

Every benchmark page includes:

Model version + hash (e.g. qwen3:8b-q4_K_M-2026-04-12)
Runtime version + config (Ollama 0.x.x, num_ctx, num_thread)
Prompts used (full text in the page footer)
Date of measurement

If you spot a discrepancy with your own setup, open an issue. We update numbers, not stand-still rankings.