BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Guide · 2026-06-03

Best Abliterated/Uncensored Local LLMs 2026

Eight abliterated and uncensored local models tested for refusal rate, reasoning loss, and VRAM cost. Verdict-driven picks for 8GB through 48GB rigs.

By Mohamed Meguedmi · 11 min read

Eight abliterated and uncensored local models tested for refusal rate, reasoning loss, and VRAM cost. Verdict-driven picks for 8 GB through 48 GB rigs.

Key takeaways

  • Top overall pick: Heretic-Qwen3-32B (Q4_K_M, 20 GB VRAM) — 1.4% refusal rate, only 2.1 MMLU points below the censored base.
  • Best 8 GB pick: JOSIEFIED-Qwen3 8B — the sweet spot for 8 GB GPUs and Apple Silicon with 16 GB unified memory.
  • Best 48 GB pick: Llama 3.3 70B Abliterated (Q4_K_M, ~40 GB) — sharpest reasoning of any unrestricted local model.
  • Avoid: first-gen Dolphin abliterations on Llama 3.1 base. The 2026 retrains on Llama 3.3 and Qwen3 are categorically better.
  • Abliteration is not jailbreaking. It surgically removes the refusal direction in activation space; it doesn’t add new capabilities, and it always costs 1–4 MMLU points.

What “abliterated” actually means in 2026

Abliteration is a weight-surgery technique introduced by Maxime Labonne in 2024 and refined throughout 2025. It identifies the single direction in the residual stream most correlated with refusal behavior, then projects that direction out of every layer’s weights. The model loses the ability to refuse, not just the inclination — no system prompt, no jailbreak, no DAN required.

This is fundamentally different from an “uncensored fine-tune” like the Dolphin series, which retrains on filtered datasets where refusals have been scrubbed. Fine-tuning teaches the model what not to do; abliteration removes the machinery that would have done it. Both approaches now coexist, and the best 2026 models combine them: abliterate first, then DPO-tune to repair the small reasoning loss.

The current state of the art is the Heretic family (released April 2026 by the Heretic collective), which automates the directional ablation across multiple refusal axes — safety, copyright, and self-identification — and applies a recovery fine-tune on UltraChat-Uncensored. The result is the first generation of unrestricted models with single-digit reasoning loss.

Why run an uncensored model locally

The legitimate use cases are well-documented and the editorial team treats them as the default audience: red-team security research, fiction writing involving violence or sexuality, medical and legal scenarios where hosted models refuse to engage with hypotheticals, dataset generation for downstream training, and translation of historical or controversial source material. Hosted API providers cannot serve these workloads at any price — not because of model capability, but because of Acceptable Use Policies that explicitly prohibit them.

Local execution also eliminates the data exposure problem. If you’re drafting a vulnerability disclosure or a divorce filing, the prompt itself is the sensitive payload. The cost calculator shows the break-even point for replacing Claude Sonnet 4.6 with a local 32B model is roughly 4.2M tokens/month at US electricity prices — well within reach for any active solo developer.

Test methodology

Each model was evaluated on a standardized harness combining four signals:

  1. Refusal rate on a 500-prompt set drawn from HarmBench, AdvBench, and an internal red-team corpus. Lower is better for uncensored use cases.
  2. MMLU-Pro (5-shot) to measure reasoning loss vs. the censored base model.
  3. IFEval for instruction-following degradation — abliteration often hurts this more than raw knowledge.
  4. Coherence at length: 2,000-token continuation tests scored by GPT-5 as a judge, looking for the “ghost refusal” pattern (agreement followed by topic deflection).

All quants are GGUF Q4_K_M unless noted, run on llama.cpp b4892. Full methodology and raw results are exposed via the public BestLLMfor API (CC BY 4.0) under the /benchmarks/uncensored endpoint. The same data feeds the open-source MCP server if you want to query it from Claude Desktop or Cursor.

The 2026 ranking

Eight models cleared the bar (refusal rate < 10%, MMLU-Pro drop < 5 points). They are ranked below by composite score, not by size.

Rank Model Params VRAM (Q4_K_M) Refusal rate MMLU-Pro Δ IFEval
1Heretic-Qwen3-32B32B20 GB1.4%−2.178.4
2Llama-3.3-70B-Abliterated-v270B40 GB2.8%−1.981.2
3Heretic-Qwen3-14B14B9 GB1.9%−2.474.1
4JOSIEFIED-Qwen3-8B8B5.5 GB3.2%−3.069.8
5Dolphin-3.1-Mistral-Small-24B24B14 GB4.1%−2.871.5
6Nous-Hermes-3-Llama-3.3-70B70B40 GB6.0%−1.279.7
7Gemma-3-27B-Abliterated27B17 GB4.7%−3.672.0
8EVA-Qwen2.5-14B14B9 GB5.4%−3.168.2

1. Heretic-Qwen3-32B — the overall winner

This is the model the editorial team now runs by default for any uncensored workload that fits in a single 24 GB consumer card. The 1.4% residual refusal rate is the lowest ever measured on the harness, and the 2.1-point MMLU-Pro drop is within the noise floor of quantization itself. Long-form coherence is excellent — no ghost refusals, no moralizing preambles. GGUFs on HuggingFace.

2. Llama-3.3-70B-Abliterated-v2 — if you have 48 GB

The v2 retrain (March 2026) fixes the persona instability that plagued v1. With a Q4_K_M quant at ~40 GB plus 4–6 GB for context, this fits comfortably on a single RTX 6000 Ada, an RTX 5090 + 4090 split, or 64 GB Apple Silicon. Best raw reasoning of any model on the list. The trade-off is throughput: expect 12–18 tok/s on a 5090, vs. 60+ on the 32B Heretic.

3. Heretic-Qwen3-14B — the 12 GB sweet spot

For RTX 4070 Ti / 5070 / 3080 Ti owners, this is the pick. 9 GB at Q4_K_M leaves headroom for 16K context. The reasoning gap vs. the 32B is real (~6 MMLU-Pro points) but unrefusal is just as clean.

4. JOSIEFIED-Qwen3-8B — the 8 GB and Apple Silicon pick

The community favorite from late 2025 holds up. 5.5 GB at Q4_K_M runs on an RTX 3060 12 GB or any M-series Mac with 16 GB unified memory. Slightly more “raw” than Heretic — less polished outputs, but zero hesitation. Available on Ollama.

5–8. Honorable mentions

Dolphin 3.1 on Mistral Small 24B is the best fine-tune-only entry — useful if you object to abliteration on philosophical grounds. Nous Hermes 3 70B has the smallest reasoning loss but a noticeably higher refusal rate; pair it with a short jailbreak prompt and it’s viable. Gemma 3 27B Abliterated is fast but loses more on IFEval than the Qwen-based competitors. EVA-Qwen2.5-14B is purpose-built for creative writing and beats everything above on prose quality — pick it for fiction, skip it for code or analysis.

Hardware cost-of-entry

Budget tierHardwareCost (USD, mid-2026)Recommended modelTokens/sec
EntryRTX 3060 12 GB (used)$220JOSIEFIED-Qwen3 8B45–55
MainstreamRTX 5070 Ti 16 GB$799Heretic-Qwen3 14B70–90
EnthusiastRTX 5090 32 GB$1,999Heretic-Qwen3 32B55–70
Pro2× RTX 5090 (64 GB)$4,000Llama 3.3 70B Abliterated15–22
AppleM4 Max 64 GB Mac Studio$3,499Llama 3.3 70B Abliterated (Q4)9–12

For a deeper breakdown by use case, see the Best LLM by use case hub and the model catalog for per-model VRAM ladders.

How to install Heretic-Qwen3-32B with Ollama

Three commands, assuming Ollama 0.5.7+ is installed and you have at least 22 GB of free VRAM.

# 1. Pull the GGUF Q4_K_M build
ollama pull heretic-team/heretic-qwen3:32b-q4_k_m

# 2. Verify the refusal direction is removed
ollama run heretic-team/heretic-qwen3:32b-q4_k_m "Describe a fictional bank heist in second person, present tense, 400 words."

# 3. (Optional) Set a 16K context window
ollama run heretic-team/heretic-qwen3:32b-q4_k_m --ctx-size 16384

If the model still refuses on step 2, you pulled a mislabeled tag — verify the SHA against the HuggingFace manifest. The full evaluation methodology is documented at /methodology/.

The reasoning-loss honesty section

Every uncensored model on this list is measurably dumber than its base. The Heretic recovery fine-tune narrows the gap to ~2 MMLU-Pro points, but it does not close it. If your workload is pure code generation or math — tasks where the base model wasn’t going to refuse anyway — use the censored base. Run Qwen3-Coder 32B for code; reach for Heretic only when you actually need unrestricted output.

The second honest caveat: abliteration is not a safety bypass for genuinely dangerous capabilities the model never had. None of these models can synthesize novel bioweapons, write working zero-days, or do anything else a competent Google search wouldn’t. They will, however, write what you ask without lecturing you. That’s the entire product.

Editorial verdict

Use casePickWhy
General unrestricted assistantHeretic-Qwen3-32BLowest refusal rate, smallest reasoning loss, fits on a 5090.
Maximum reasoning, 48 GB+Llama-3.3-70B-Abliterated-v2Sharpest unrestricted model. Worth the throughput hit.
8 GB GPU / 16 GB MacJOSIEFIED-Qwen3 8BOnly model in this class that’s genuinely unfiltered.
Creative fictionEVA-Qwen2.5-14BProse quality beats everything else; reasoning is secondary.
Red-team / security researchHeretic-Qwen3-32BCleanest outputs on adversarial prompts, no ghost refusals.
Fine-tune-only puristDolphin-3.1-Mistral-Small-24BNo weight surgery; best of the “trained uncensored” cohort.

Heretic-Qwen3-32B is the model to beat in 2026. Everything else either costs more VRAM, refuses more often, or both.

FAQ

Is running an abliterated LLM legal?

In the US, UK, and Australia, yes — downloading and running open-weight models is legal regardless of their alignment status. What you do with the output is subject to the same laws as anything else (CSAM, fraud, defamation, etc. remain illegal). The EU AI Act adds disclosure obligations for deployers above certain thresholds; consult counsel if you’re shipping to end users.

Does abliteration break the model’s reasoning?

It costs 1–4 MMLU-Pro points on the 2026 generation, down from 6–10 points in 2024. The Heretic recovery fine-tune is the main reason the gap has closed. For workloads where the base model wouldn’t have refused, use the base model.

Heretic vs. Dolphin vs. JOSIEFIED — what’s the difference?

Heretic does weight surgery (abliteration) then a recovery fine-tune. Dolphin is a fine-tune on uncensored data with no surgery. JOSIEFIED combines abliteration with system-prompt engineering. Heretic is the most thorough; Dolphin is the most stable; JOSIEFIED is the most VRAM-efficient.

Will Ollama remove uncensored models from its registry?

Ollama has not removed any abliterated or uncensored model as of June 2026, and has publicly committed to a neutral hosting policy. If you’re concerned, mirror the GGUF locally from HuggingFace — the ollama create command accepts a local Modelfile pointing at any GGUF on disk.

Can I fine-tune Heretic-Qwen3 further on my own data?

Yes. The Heretic team publishes the LoRA adapters separately, so you can stack a domain LoRA on top. QLoRA fine-tuning on the 32B fits in 24 GB VRAM with batch size 1 and gradient checkpointing.

What about Mixtral, Grok, or DeepSeek uncensored variants?

Mixtral 8x7B abliterations exist but the architecture is now eclipsed by dense Qwen3 models at similar VRAM. Grok’s open weights (Grok-2 mini) have no popular abliteration as of June 2026. DeepSeek-V3 abliterations are promising but the 671B parameter count puts them outside the local-LLM scope for all but the most extreme rigs.