BestLLMfor EN Your hardware. Your LLM. Your call.
APIOpen data Find my LLM
Model fiche

Llama 3.2 Vision 11B

By Meta · United States

vision chat
Parameters
11B
License
Llama 3 Community
Context
128k
VRAM (Q4)
8 GB
Released
September 2024

Overview

Meta's first official multimodal Llama. An 11B vision-language model built on Llama 3.1 8B with added image adapters and a 128k text context.

When to pick this model

  • OCR and document understanding on a consumer GPU
  • Image captioning and description pipelines
  • Chart and graph analysis
  • Mixed text-and-image RAG workloads
  • Llama ecosystem deployments needing vision

VRAM requirements by quantization

QuantizationVRAM required
Q4_K_M (recommended)8 GB
Q5_K_M10 GB
Q8_014 GB
FP16 (no quantization)24 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

Published benchmark scores

BenchmarkScore
MMMU50.7
DocVQA88.4

Scores published by the model author or aggregated from public leaderboards. Re-measured monthly by our editorial team.

Strengths

  • 128k text context with image input
  • Strong OCR and image description
  • Built on the well-supported Llama 3 base
  • First-party Meta multimodal release

Limitations

  • Vision quality trails Qwen2-VL and LLaVA-OneVision
  • Subject to Llama Community license terms
  • No video understanding
  • Image inputs add significant VRAM overhead

Architecture & training

Architecture: Dense · 11B · vision cross-attention · CLIP encoder · Llama 3.2

Training: Llama 3.1 8B + vision adapters. First official Meta vision model.

Verdict

A solid Llama-family vision model — but Qwen2-VL is the better open-weight choice when license terms allow.

Quick start

ollama run llama3.2-vision:11b

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Tools

Is Llama 3.2 Vision 11B the right pick for you?

Compute self-hosted ROI → Back to catalog