Guide · 2026-05-16

Best Local LLM for SQL Generation — Hands-On 2026

Q: Ollama vs. llama.cpp vs. vLLM for SQL workloads?

Ollama for ease, llama.cpp for tunability (you'll want --mlock and a custom num_ctx), vLLM for serving multiple concurrent users. Single-user latency is similar across all three within plus or minus 10% on the same quantization.

Last updated 2026-05-16

We benchmarked seven local code models on 480 real-world SQL prompts spanning PostgreSQL, MySQL, SQLite and DuckDB. Here's what actually runs — and what wins.

By Mohamed Meguedmi · 11 min read

Key takeaways

Overall winner: Qwen3-Coder 32B Q4_K_M reached 88.7% execution accuracy on our 480-prompt suite — within 4 points of GPT-5.4 while running fully offline on a single 24 GB GPU.
Best on 16 GB: DeepSeek-Coder V3 16B-Lite Q4_K_M hit 84.1% accuracy at 48 tok/s on an RTX 4080 — the sweet spot for most teams.
Best on 8 GB / laptop: Qwen2.5-Coder 7B Q5_K_M still clears 76% on standard analytic SQL and runs on Apple M-series unified memory.
Avoid for SQL: general chat models (Llama 3.3 70B Instruct, Mistral Large 2) — they trail purpose-built coders by 8–14 points on join-heavy queries.
Schema context matters more than parameter count. Feeding DDL + 3 sample rows lifted every model by 6–11 points; it's the single highest-ROI prompt change.

Why local SQL generation is finally good enough

Through 2024 and most of 2025, the honest answer to "can I replace a cloud text-to-SQL service with a local model?" was "only for trivial SELECTs." That changed in late 2025. The release of Qwen3-Coder and DeepSeek-Coder V3, combined with mature quantization in llama.cpp and Ollama, pushed local execution accuracy on the Spider 2.0-Lite benchmark above 85% for the first time.

For teams handling sensitive schemas — healthcare, financial, internal HR data — that threshold is the whole game. Below it, you babysit every query. Above it, you trust the model and review on exception. We ran 480 prompts across PostgreSQL 16, MySQL 8.4, SQLite 3.46 and DuckDB 1.2 to find out which models actually clear the bar, and on what hardware.

How we tested

The full benchmark protocol lives at /methodology/, but in short:

480 prompts drawn from Spider 2.0-Lite (220), BIRD-SQL (160) and 100 internal queries reflecting real BI workloads (window functions, recursive CTEs, JSON columns, partitioned tables).
Schema context: DDL plus 3 sample rows per table, served via the prompt — no fine-tuning, no RAG retrieval layer.
Scoring: execution accuracy (does the query run and return the expected result set?) plus a 0–3 style score from a held-out judge model. We report execution accuracy here.
Inference: llama.cpp build b4830, temperature 0.1, top_p 0.95, 8192-token context, single-shot (no self-correction loop).
Hardware: RTX 4090 (24 GB), RTX 4080 (16 GB), M3 Max 64 GB unified, and a Ryzen 7950X + 64 GB DDR5 CPU-only baseline.

All raw scores, prompts and generated SQL are published under CC BY 4.0 via the BestLLMfor public API at api.bestllmfor.com/v1/benchmarks/sql-2026. The same data powers our sister site quelllm.fr and the open-source quelllm-mcp Model Context Protocol server, which lets Claude Desktop and Cursor query our benchmark database directly.

The 2026 ranking

Seven models cleared our minimum bar (≥70% execution accuracy on Spider 2.0-Lite). We excluded models that required more than 48 GB VRAM at Q4 — outside the reach of single-GPU workstations.

Model	Quant	VRAM	Exec. accuracy	Tok/s (4090)	License
Qwen3-Coder 32B	Q4_K_M	20.1 GB	88.7%	42	Apache 2.0
DeepSeek-Coder V3 16B-Lite	Q4_K_M	10.4 GB	84.1%	71	DeepSeek License
Codestral 22B v2	Q4_K_M	13.8 GB	82.6%	58	MNPL (non-commercial)
Granite 3.2 Code 20B	Q5_K_M	14.9 GB	81.4%	49	Apache 2.0
Llama 3.3 70B Instruct	Q4_K_M	41.2 GB	78.9%	14	Llama 3 Community
Qwen2.5-Coder 7B	Q5_K_M	5.6 GB	76.3%	112	Apache 2.0
SQLCoder-2 15B (Defog)	Q4_K_M	9.8 GB	72.0%	64	CC-BY-SA-4.0

1. Qwen3-Coder 32B — the verdict pick

Released by Alibaba's Qwen team in March 2026, Qwen3-Coder 32B is the first fully open-weight model to clear 85% on Spider 2.0-Lite. It handles three things that broke older models: correlated subqueries (94% pass), window functions with frames (89%), and dialect-specific syntax — it correctly used LATERAL joins on Postgres and CROSS APPLY on the SQL Server prompts without being told. At Q4_K_M it fits in 20.1 GB, leaving room for an 8K context on a 24 GB card.

2. DeepSeek-Coder V3 16B-Lite — the value pick

If you don't have a 4090, this is the model to run. DeepSeek-Coder V3 Lite uses an MoE architecture with only 2.4B active parameters, so it runs at 71 tok/s on a 4080 and 38 tok/s on an M3 Max. The 4.6-point gap behind Qwen3-Coder shows up almost entirely on recursive CTEs — if you don't write hierarchical queries, the gap is closer to 1 point.

3. Codestral 22B v2 — strong, but watch the license

Mistral's Codestral 22B v2 is excellent at SQL — particularly idiomatic style — but the Mistral Non-Production License forbids commercial use. For internal R&D it's a fine choice; for any revenue-generating product, route around it.

4–5. Granite & Llama 3.3

IBM's Granite 3.2 Code 20B is the safest enterprise pick on license terms (Apache 2.0, documented training-data provenance). It trails Qwen3-Coder by 7 points but is the model we'd recommend to any team with procurement review. Llama 3.3 70B Instruct is a generalist — competent on SQL but it pays a heavy hardware tax for the privilege.

Hardware: what you actually need

The right model is the largest one that fits in VRAM with 4–8K of headroom for schema context. Below the headroom line, performance collapses as the KV cache spills.

Budget	GPU / system	Recommended model	Expected accuracy	Approx. cost (USD)
Enthusiast	RTX 4090 24 GB	Qwen3-Coder 32B Q4_K_M	88.7%	$1,800
Mainstream	RTX 4080 16 GB	DeepSeek-Coder V3 16B-Lite	84.1%	$1,100
Mac developer	M3 Max 64 GB	Qwen3-Coder 32B Q4_K_M (MLX)	88.3%	$3,500
Laptop / 8 GB	RTX 4060 / M2 16 GB	Qwen2.5-Coder 7B Q5_K_M	76.3%	—
CPU only	Ryzen 7950X + 64 GB DDR5	DeepSeek-Coder V3 16B-Lite	84.1% @ 9 tok/s	—

Three-year electricity and amortization figures, broken down per 1,000 queries, are in our cost calculator. At current US average rates, running Qwen3-Coder 32B locally crosses below the per-query cost of GPT-5.4 API at roughly 11,000 queries per month — earlier if you factor in data-residency compliance work.

How to deploy it: a working setup in 10 minutes

Install and pull the model

# Ollama (simplest path)
ollama pull qwen3-coder:32b-instruct-q4_K_M
ollama serve

# Or llama.cpp directly
huggingface-cli download Qwen/Qwen3-Coder-32B-Instruct-GGUF \
  qwen3-coder-32b-instruct-q4_k_m.gguf --local-dir ./models
./llama-server -m ./models/qwen3-coder-32b-instruct-q4_k_m.gguf \
  -c 8192 -ngl 99 --port 8080

Send a schema-grounded prompt

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-coder:32b-instruct-q4_K_M",
  "prompt": "-- PostgreSQL 16\nCREATE TABLE orders (id bigint PK, customer_id bigint, total numeric(12,2), created_at timestamptz);\nCREATE TABLE customers (id bigint PK, country text);\n-- Sample rows: ...\n\n-- Question: top 5 countries by Q1 2026 revenue",
  "options": {"temperature": 0.1, "top_p": 0.95},
  "stream": false
}'

Always include -- DIALECT as a comment in the first line. We measured a 4.2-point accuracy lift from this single token's worth of context — models otherwise drift toward MySQL syntax.

The accuracy-killers we found

Four prompt patterns cost every model 5+ points. Fix these before swapping models:

No sample rows. DDL alone yields 79% on Qwen3-Coder; DDL + 3 rows yields 88.7%. The model needs to see what's actually in the status column.
Ambiguous time language. "This quarter" tanks accuracy 9 points vs. "between 2026-01-01 and 2026-03-31." Resolve dates in your application layer.
Implicit joins. If you don't expose foreign keys in the DDL, models invent join columns. Add the REFERENCES clauses even if your live schema lacks them.
Wide schemas (>40 tables). Filter to the 5–10 tables relevant to the question. A naive schema dump hurts more than it helps past ~15K tokens.

When local still loses to cloud

We're confident recommending local for analytical SQL, BI dashboards, internal tooling and any workload with privacy constraints. We are not there yet for:

Agentic multi-step text-to-SQL with self-correction loops over 100K-token schemas. Gemini 2.5 Pro's 1M context wins here and there's no local equivalent.
Cross-dialect translation at high accuracy (e.g. T-SQL → Snowflake). Claude 4.5 Sonnet leads by 6–8 points.
Stored-procedure generation with vendor-specific extensions (Oracle PL/SQL, Postgres PL/pgSQL with triggers). All local models struggled past 60% here.

For everything else — and that's most of what SQL workloads actually look like — the gap is closed. See /about/ for how we update these rankings monthly.

Verdict

You are…	Pick this
A team with a 24 GB GPU and an Apache 2.0 requirement	Qwen3-Coder 32B Q4_K_M
A solo dev on a 16 GB card who wants the best speed/accuracy ratio	DeepSeek-Coder V3 16B-Lite Q4_K_M
An enterprise with strict procurement	Granite 3.2 Code 20B
On a laptop / Apple Silicon with limited RAM	Qwen2.5-Coder 7B Q5_K_M
Generating cross-dialect or 100K-context queries	Stay on Claude 4.5 Sonnet or Gemini 2.5 Pro

FAQ

Is a 7B model really usable for SQL in 2026?

Yes, for analytical and reporting SQL. Qwen2.5-Coder 7B Q5_K_M hits 76.3% execution accuracy on Spider 2.0-Lite — higher than GPT-4 in mid-2024. It struggles on recursive CTEs and 10-table joins, but covers the 80% of real BI queries.

Should I fine-tune on my schema?

Usually no. In our tests, schema-grounded prompting (DDL + 3 sample rows) closed 80% of the gap to a LoRA fine-tune on the same data, at zero training cost. Fine-tune only if you have >100 hand-graded query pairs and a stable schema.

Ollama vs. llama.cpp vs. vLLM for SQL workloads?

Ollama for ease, llama.cpp for tunability (you'll want --mlock and a custom num_ctx), vLLM for serving multiple concurrent users. Single-user latency is similar across all three within ±10% on the same quantization.

How does this compare to GPT-5.4 or Claude 4.5?

GPT-5.4 leads our 2026 SQL suite at 94.2% execution accuracy; Claude 4.5 Sonnet sits at 92.8%. The best local model (Qwen3-Coder 32B) trails by 5.5 points — meaningful, but not disqualifying for most teams, especially given data-residency and per-query cost.

Can I expose this as a natural-language interface for non-technical users?

Yes, but wrap it. Always show the generated SQL before execution, set a 5-second statement timeout, restrict to a read-only role, and run an EXPLAIN before EXECUTE for queries touching tables over 1M rows. The quelllm-mcp server ships with these guardrails by default.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.