Methodology — open by default
Everything you see on this site is sourced from a public protocol. Hardware, model versions, runtime config, and prompt sets are documented — so you can reproduce or challenge the numbers.
1. Catalog data — how models are indexed
Models are pulled from public sources — Hugging Face, Ollama library, vendor model cards — and reviewed daily by an automated pipeline.
Each model entry exposes: parameters count, quantization variants, license, VRAM requirement at common quants (Q4_K_M, Q5_K_M, Q8_0, FP16), context length, and one-line use case.
The full catalog is exposed under CC BY 4.0 at bestllmfor.com/api/models.json — free to reuse with attribution.
2. Benchmarks — tokens/second on real hardware
Two-tier measurement system:
- Tier 1 — measured: models we ran ourselves through the benchmark pipeline at a documented quant and context length. Numbers in tokens/sec, with cold-start and warm-start variants.
- Tier 2 — profiled: estimates derived from a "neighbor profile" (a similar architecture model we did run), normalized by parameter count and quant. Always labeled as estimate.
Prompt set is 5 standard prompts spanning code generation, summarization, French translation, JSON output, and long-context retrieval. Each prompt is run 3 times, median reported.
3. VRAM requirements — how we estimate
VRAM at a given quantization is computed from:
- Model size (parameters × bits-per-weight at quant)
- KV cache size (context length × layers × heads, fp16)
- Runtime overhead (~600 MB for Ollama + llama.cpp baseline)
Formulas and constants are in the open-source MCP server repository,
function estimate_vram(model_id, quant, context).
4. Cost calculator — pricing sources
API prices are pulled from vendor pricing pages and cross-checked semestrially. Currently indexed:
- OpenAI — GPT-5, GPT-5 mini
- Anthropic — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5
- Google — Gemini 2.5 Pro, Flash
- DeepSeek — V3.5, R1
- Mistral — Large 2.5, Small 3.5
Self-hosted cost is computed from hardware amortization (you set the months), electricity (you set the €/kWh or $/kWh), GPU TDP, and active hours/day. Output is the monthly break-even token volume.
5. Editorial process — rankings & recommendations
Rankings (e.g. "Best local LLM for coding on RTX 4090") are produced by a hybrid process:
- Filter the catalog by hard constraints (fits in VRAM, license compatible)
- Score remaining candidates against task-specific benchmarks (SWE-bench for coding, MMLU for reasoning, FLORES-200 for translation, etc.)
- Manual review by the author with notes on quirks (regression bugs, prompt sensitivity, refusal rate)
- Public source for every claim: model card, paper, or our own benchmark
No paid placement. No referral kickbacks. If a model leads a ranking, the numbers in the article say why.
6. Reproducing our numbers
Every benchmark page includes:
- Model version + hash (e.g.
qwen3:8b-q4_K_M-2026-04-12) - Runtime version + config (Ollama 0.x.x,
num_ctx,num_thread) - Prompts used (full text in the page footer)
- Date of measurement
If you spot a discrepancy with your own setup, open an issue. We update numbers, not stand-still rankings.