Model fiche

Salamandra 40B Instruct

By BSC · Spain

Updated 2026-07-13

chat general multilingual fr

Parameters

40B

License

Apache 2.0

Context

VRAM (Q4)

24 GB

Released

December 2024

Overview

BSC's 40B scaled-up Salamandra covering 35 EU languages with native Catalan support — though the HuggingFace repo is gated and successor ALIA-40B is now available.

When to pick this model

EU-sovereign workloads needing 40B-class quality
Romance-language content generation, especially Catalan
Public-sector and regulated deployments in Europe
Multilingual research baselines across 35 EU languages
Workflows already provisioned for ALIA-40B comparisons

VRAM requirements by quantization

Quantization	VRAM required
Q4_K_M (recommended)	24 GB
Q5_K_M	29 GB
Q8_0	43 GB
FP16 (no quantization)	80 GB

VRAM figures include model weights plus a typical 8k KV cache and ~600 MB runtime overhead (Ollama / llama.cpp baseline). Add headroom for higher context lengths.

In practice, Salamandra 40B Instruct wants a 24 GB card at Q4_K_M (24 GB). Stepping up to Q8_0 nearly doubles the footprint to 43 GB, and unquantized FP16 weights take 80 GB — plan your GPU around the Q4 or Q5 figure unless you specifically need the higher fidelity.

Without a GPU, Salamandra 40B Instruct needs roughly 40 GB of system RAM to run on CPU via llama.cpp or Ollama — workable for background jobs, but far slower than GPU inference. Throughput estimates from our compatibility engine: around 2 tokens/sec on entry-level GPUs, on the order of 10 tokens/sec on a mid-range card, and up to 25 tokens/sec on high-end hardware — assuming the chosen quantization fully fits in VRAM.

What hardware do you need

The table below matches Salamandra 40B Instruct to common GPU memory tiers, using the highest-fidelity quantization that fully fits each card class. Spilling layers to system RAM works but costs most of the speed, so size your card to the quantization you actually want to run.

GPU memory	Example cards	Best fit for Salamandra 40B Instruct
8 GB	RTX 5070 Laptop, RTX 5060, RTX 5060 Ti 8GB	Does not fit — needs 24 GB at Q4_K_M
12 GB	RTX 5070, RTX 5070 Ti Laptop, RTX 4080 Laptop	Does not fit — needs 24 GB at Q4_K_M
16 GB	RTX 5080, RTX 4080 Super, Radeon RX 9070 XT	Does not fit — needs 24 GB at Q4_K_M
24 GB	RTX 4090, Radeon RX 7900 XTX, RTX 5090 Laptop	Q4_K_M (24 GB used)
32 GB	RTX 5090	Q5_K_M (29 GB used)

Which GPU should you buy to run Salamandra 40B Instruct?

To run Salamandra 40B Instruct locally at Q4, you need ~24 GB of VRAM. The best value for this is a RTX 4090 (24 GB VRAM).

Check RTX 4090 price on Amazon →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.

Strengths

Sovereign European model purpose-built for Romance languages
Unique native Catalan capability among open models
Apache 2.0 license
7.68T tokens with strong Iberian-language coverage

Limitations

~24 GB VRAM at Q4
8192-token context limits modern long-context use
Limited fine-tune ecosystem and gated repo access

Typical workloads

In our catalog grid, Salamandra 40B Instruct is filed under Advanced EU Multilingual — the use cases where its size/quality trade-off makes the most sense. Its tags translate to concrete workloads: multilingual workloads; French-language output where quality matters.

Note the 8k-token context window — fine for short interactions, limiting for long documents or big retrieval contexts. The Apache 2.0 license is permissive, so shipping it inside a commercial product raises no special legal questions.

Architecture & training

Architecture: Dense · 40B · BSC MareNostrum · sovereign Romance languages

Training: Barcelona Supercomputing Center — 7.68T tokens, strong in Catalan, Spanish, French, Occitan.

Verdict

The strongest open model for Catalan and Iberian Romance languages — but check ALIA-40B first if you can run either.

Quick start

ollama pull hf.co/BSC-LT/salamandra-40b-instruct-GGUF

Or use the open-source MCP server to query this model from Claude Desktop, Cursor, or any MCP-compatible client.

Similar models worth comparing

70B · Apache 2.0 · q4 40 GB

Apertus 70B

needs 16 GB more VRAM at Q4

35B · CC-BY-NC 4.0 · q4 20 GB

Aya 23 35B

needs 4 GB less VRAM at Q4

7B · Apache 2.0 · q4 5 GB

Lucie 7B

needs 19 GB less VRAM at Q4

7.7B · Apache 2.0 · q4 5 GB

Salamandra 7B Instruct

same family · needs 19 GB less VRAM at Q4

36B · Apache 2.0 · q4 22 GB

Seed-OSS 36B Instruct

needs 2 GB less VRAM at Q4

Frequently asked questions

How much VRAM does Salamandra 40B Instruct need?

At the recommended Q4_K_M quantization, Salamandra 40B Instruct needs about 24 GB of VRAM. Q8_0 takes 43 GB, and unquantized FP16 weights take 80 GB.

Can Salamandra 40B Instruct run without a GPU?

Yes — with roughly 40 GB of system RAM it runs CPU-only through llama.cpp or Ollama. Expect a fraction of GPU speed, which is fine for background or batch jobs but slow for interactive chat.

What context window does Salamandra 40B Instruct support?

Salamandra 40B Instruct supports a 8k-token context window (8,192 tokens).

Can I use Salamandra 40B Instruct commercially?

Yes. Salamandra 40B Instruct is released under Apache 2.0, a permissive open-source license that allows commercial use, modification and redistribution.

How fast is Salamandra 40B Instruct on consumer hardware?

Our compatibility engine estimates on the order of 10 tokens/sec on a mid-range GPU and up to 25 tokens/sec on high-end cards, assuming the quantization fully fits in VRAM.

Which quantization of Salamandra 40B Instruct should I download first?

Start with Q4_K_M (24 GB) — the standard size/quality sweet spot. Step up to Q5_K_M or Q8_0 only if you have VRAM headroom. On a 24 GB card you can run up to Q4_K_M.

Tools

Is Salamandra 40B Instruct the right pick for you?

Compute self-hosted ROI → Back to catalog