Editorial ranking · 2026

Best local LLM with vision

Q: What is the best local LLM for vision and multimodal tasks?

Qwen 3 VL 30B-A3B tops this ranking — a 30B model, licensed under Apache 2.0, needing about 19 GB of VRAM at Q4 quantization. See the full list below for the runner-ups and how they compare.

Last updated 2026-05-26 · Page updated 2026-07-13

Top 7 open-source picks for vision and multimodal tasks, ranked by benchmark performance and real-world fit. Updated monthly.

Qwen 3 VL 30B-A3B

30B · Alibaba · Apache 2.0

Qwen 3 VL's sweet spot: a 30B MoE with 3B active parameters and 256k context. Delivers most of the 235B's quality at a fraction of the hardware cost.

VRAM Q4: 19 GB · Context: 256k

Read full fiche →

Nemotron Nano v2 VL 12B

12.6B · NVIDIA · NVIDIA Open Model License

NVIDIA's 12.6B enterprise VLM with strong DocVQA and ChartQA scores, tuned for professional document extraction workflows.

VRAM Q4: 8 GB · Context: 125k

Read full fiche →

Qwen 2.5 VL 7B

7B · Alibaba · Apache 2.0

A 7B vision-language model from Alibaba with state-of-the-art results in its class, scoring 95.7 on DocVQA. Handles hour-long video, bounding-box grounding, and multilingual OCR.

VRAM Q4: 6 GB · Context: 125k

Read full fiche →

Qwen 3 VL 8B

8B · Alibaba · Apache 2.0

The dense 8B entry in Qwen 3 VL, offering strong OCR and document analysis with a remarkable 256k multimodal context for its size.

VRAM Q4: 6 GB · Context: 256k

Read full fiche →

Qwen 3 Omni 30B-A3B

30B · Alibaba · Apache 2.0

Alibaba's omni-modal 30B MoE (3B active) with streaming speech, 119-language ASR, and Apache 2.0 licensing. The most accessible truly omnimodal open model.

VRAM Q4: 19 GB · Context: 128k

Read full fiche →

LLaDA 2.0 Uni 16B

16B · Ant Group / inclusionAI · Apache 2.0

Ant Group's first open Apache 2.0 diffusion LLM: a 16B/1B MoE paired with a 6.2B diffusion decoder, unifying text and vision generation and editing. Released April 2026.

VRAM Q4: 18 GB · Context: 8k

Read full fiche →

Mistral Small 3.1 24B

24B · Mistral AI · Apache 2.0

Mistral AI's Small 3.1 — Small 3 plus a vision encoder, a 128k context, and ~150 tok/s inference under Apache 2.0. Small 3.2 (June 2025) is a drop-in upgrade.

VRAM Q4: 14 GB · Context: 125k

Read full fiche →

Which GPU should you buy to run Qwen 3 VL 30B-A3B?

To run Qwen 3 VL 30B-A3B locally at Q4, you need ~19 GB of VRAM. The best value for this is a RTX 4090 (24 GB VRAM).

Check RTX 4090 price on Amazon →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.

Frequently asked questions

What is the best local LLM for vision and multimodal tasks?

Qwen 3 VL 30B-A3B tops this ranking — a 30B model, licensed under Apache 2.0, needing about 19 GB of VRAM at Q4 quantization. See the full list below for the runner-ups and how they compare.

How much VRAM do I need to run Qwen 3 VL 30B-A3B?

At Q4 quantization, Qwen 3 VL 30B-A3B needs about 19 GB of VRAM and fits comfortably on a single 24 GB GPU.

Which of these models fit an 8 GB GPU?

At Q4 quantization, Nemotron Nano v2 VL 12B, Qwen 2.5 VL 7B, Qwen 3 VL 8B fit within 8 GB of VRAM.

Are the models on this vision and multimodal tasks list free for commercial use?

Licenses across this list include Apache 2.0, NVIDIA Open Model License. Check the specific license of each model on its catalog page before deploying commercially, as terms vary by author.

What context window do these models support?

Context windows on this list range from 8k to 256k tokens, depending on the model.