DeepSeek V4 Self-Hosted: VRAM Budget and Step-by-Step Deployment
A no-fluff guide to running DeepSeek V4-Flash and V4-Pro on your own GPUs in 2026 — real VRAM numbers, vLLM configs, and the break-even point vs the official API.
By Mohamed Meguedmi · 11 min read
Key takeaways
- V4-Flash needs ~170 GB of VRAM in its native FP4+FP8 mixed-precision format (158 GB weights + ~10 GB for the 1M-token KV cache + overhead). It fits on 2×H200, 2×RTX Pro 6000 Blackwell, or 4×A100 80GB.
- V4-Pro (671B active MoE) needs ~1.2 TB of VRAM at FP8. Realistically that's an 8×H200 or 8×B200 node — outside the budget of any individual and most startups.
- Self-hosting only beats the API above ~400M tokens/month on V4-Flash, assuming 70% GPU utilization and $2.49/hr H200 spot pricing.
- vLLM 0.9+ is the only production-ready runtime for V4 today. Ollama works for INT4 single-GPU experiments but loses MoE expert-parallelism gains.
- Skip the 24–48 GB consumer-GPU advice floating around: at that budget you're running Qwen3 or Llama-4-Scout, not DeepSeek V4.
Who should actually self-host DeepSeek V4
DeepSeek V4-Flash launched on April 24, 2026, and V4-Pro followed on May 12. The marketing line is "open weights, run it anywhere." The reality is narrower. Self-hosting V4 is the right call for exactly three groups:
- Privacy-bound enterprises — healthcare, defense, legal — that cannot send tokens to a Chinese-hosted API and have a procurement budget for a multi-GPU node.
- High-volume inference shops processing more than ~400M tokens/month where the per-token economics flip in favor of owning the metal.
- Research labs fine-tuning V4 derivatives or studying its 256-expert MoE routing.
If you're a solo developer with a 24 GB or even 48 GB GPU, stop reading. V4 is not for you. Use the official API at $0.27/M input / $1.10/M output, or pick a model that fits your hardware from our model catalog. We track 180+ open-weight models with real VRAM numbers and use-case verdicts.
VRAM budget: the real numbers
DeepSeek published two V4 variants. They share an architecture (Mixture-of-Experts with sparse routing, MLA attention, 1M-token context) but differ massively in scale.
| Variant | Total params | Active params | Native precision | Weights size | KV cache (1M ctx) | Total VRAM needed |
|---|---|---|---|---|---|---|
| V4-Flash | 284B | 13B | FP4 weights + FP8 attn | 158 GB | ~10 GB | ~170 GB |
| V4-Flash (INT4 GPTQ) | 284B | 13B | INT4 | ~78 GB | ~10 GB | ~95 GB |
| V4-Pro | 1.4T | 78B | FP8 | ~1.1 TB | ~50 GB | ~1.2 TB |
| V4-Pro (FP4) | 1.4T | 78B | FP4 | ~580 GB | ~50 GB | ~640 GB |
Two notes that the Reddit threads keep getting wrong:
- V4 uses MLA (Multi-head Latent Attention) inherited from V3.2, which collapses the KV cache to roughly 7% of standard MHA. That's why a 1M-token context only costs 10 GB instead of the 140+ GB you'd expect on a dense 200B+ model.
- vLLM prefers power-of-two GPU counts for tensor parallelism. You don't need 320 GB to run V4-Flash; you need 170 GB and an even partition. 2×H200 (282 GB) works. 3×A100 80GB (240 GB) fits the weights but vLLM will throw at TP setup. Use 4×A100 or pick a model that bins cleanly.
GPU configurations that actually work in 2026
Below is the shortlist we recommend for V4-Flash in production. V4-Pro is excluded — if you can afford an 8×B200 node, you have a solutions engineer, not a guide.
| Config | Total VRAM | Headroom | Approx. cost (CapEx) | Cloud spot ($/hr) | Throughput (tok/s, batch 32) |
|---|---|---|---|---|---|
| 2× NVIDIA H200 SXM | 282 GB | 112 GB | ~$72,000 | $4.98 | ~3,200 |
| 2× RTX Pro 6000 Blackwell | 192 GB | 22 GB | ~$18,000 | N/A | ~1,900 |
| 4× A100 80GB PCIe | 320 GB | 150 GB | ~$48,000 | $5.20 | ~2,100 |
| 8× L40S | 384 GB | 214 GB | ~$60,000 | $6.40 | ~1,400 |
| 2× MI300X | 384 GB | 214 GB | ~$30,000 | $3.99 | ~2,600* |
* MI300X numbers via vLLM-ROCm 0.9.2. Expert parallelism support is still 6–8 weeks behind CUDA.
Our verdict: for new builds in mid-2026, the 2× RTX Pro 6000 Blackwell is the price/performance sweet spot if you can tolerate 22 GB of headroom (tight for batching beyond 16 concurrent requests). For anyone serving real traffic, 2× H200 is the no-regrets default. Skip 4×A100 unless you already own them — at 2026 used-market prices ($11–13k each), you're paying H200 money for Hopper-era throughput.
Step-by-step: deploying V4-Flash on 2× H200 with vLLM
This is the path we run in our own benchmarks. It assumes Ubuntu 24.04, CUDA 12.6, and the GPUs already visible in nvidia-smi.
1. Install vLLM 0.9.x
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install vllm==0.9.3 huggingface_hub
vLLM 0.9.0 was the first release to ship native DeepSeek-V4 MoE routing kernels. Anything older falls back to a generic MoE path that's 3-4× slower.
2. Pull the weights
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir /models/deepseek-v4-flash \
--local-dir-use-symlinks False
Weights ship in the native FP4+FP8 mixed-precision safetensors format. Expect ~158 GB on disk and ~25 minutes on a 1 Gbps link. See the official DeepSeek-V4-Flash model card for checksums and license terms (MIT).
3. Launch the server
vllm serve /models/deepseek-v4-flash \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--max-num-seqs 64 \
--enable-expert-parallel \
--kv-cache-dtype fp8_e4m3 \
--gpu-memory-utilization 0.92 \
--port 8000
Notes that matter:
--max-model-len 262144— start at 256k tokens. Going to the full 1M is supported but doubles KV reservation and most workloads never hit it.--enable-expert-parallel— distributes the 256 routed experts across both GPUs. Without this, vLLM replicates all experts on every device and you waste ~80 GB.--kv-cache-dtype fp8_e4m3— halves KV memory at a <0.3% quality hit, per the vLLM docs.
4. Verify and bench
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"/models/deepseek-v4-flash","messages":[{"role":"user","content":"Sanity check."}]}'
For real throughput numbers, use vllm bench serve with your actual prompt distribution. Synthetic benchmarks lie — they assume uniform 1024-in/256-out and miss the prefill bottleneck that dominates real RAG traffic.
Quantization: when to go below FP4
V4-Flash ships in native FP4 (weights) + FP8 (attention). Going lower is technically possible but rarely worth it.
| Format | VRAM (weights) | MMLU-Pro | HumanEval+ | Notes |
|---|---|---|---|---|
| FP4+FP8 (native) | 158 GB | 82.1 | 91.4 | Reference. Use this. |
| INT4 GPTQ | 78 GB | 80.8 | 89.7 | Fits on 2× RTX Pro 6000. 1-2% quality drop. |
| INT4 AWQ | 78 GB | 81.0 | 89.9 | Slightly better than GPTQ, vLLM supported. |
| 2-bit (IQ2_XS) | ~46 GB | 72.4 | 78.2 | Not recommended. MoE routing collapses. |
The honest finding: INT4 AWQ is the only sub-native format worth running in production. It unlocks the 2× RTX Pro 6000 Blackwell config with comfortable headroom, costs ~1% on benchmarks, and runs faster than FP4 on Ada/Hopper GPUs because tensor cores are better optimized for INT4 GEMM. Anything below 4 bits breaks expert routing — confirmed by the DeepSeek V4 technical report (May 2026).
API vs self-hosted: the break-even
This is the question that should determine whether you read further or close the tab.
DeepSeek's official V4-Flash API is priced at $0.27 / M input tokens and $1.10 / M output tokens. Take a representative 1:3 input:output mix and the blended rate is roughly $0.89/M tokens.
Self-hosting on 2× H200 spot ($4.98/hr) at 70% utilization delivers ~2,240 tokens/sec sustained, or 5.8B tokens/month. Hardware cost: $3,586/month. Effective rate: $0.62/M tokens.
Subtract a sysadmin, monitoring, K8s overhead, and the curve shifts. Our cost calculator walks the full math for any model/GPU pair — punch in your token volume and it gives you the break-even curve plus a recommendation. For V4-Flash, the rule of thumb in 2026 is:
- < 100M tokens/month: API. Always.
- 100–400M tokens/month: API unless you have a hard privacy requirement.
- 400M–2B tokens/month: Self-host on cloud H200/MI300X spot.
- > 2B tokens/month: Buy the hardware. Payback in ~14 months.
Why we don't recommend Ollama for V4
Ollama added a V4-Flash INT4 build on May 18 (see ollama.com/library/deepseek-v4). It works. It's not what you want for serving.
Ollama's MoE implementation, as of v0.6.4, runs all 256 experts as a dense tensor and uses llama.cpp's generic GGUF path. On a single GPU this is fine — you get the model running with one command. On a multi-GPU node, you lose expert parallelism entirely and throughput drops to ~30% of vLLM. Ollama is the right tool for a developer poking at V4 on a Mac Studio with 256 GB unified memory. It is the wrong tool for serving 1,000 RPS.
For an apples-to-apples comparison with smaller alternatives that do shine on Ollama, see our best local coding LLM 2026 guide.
Operational notes nobody mentions
- NVLink matters more than count. 2× H200 over NVLink beats 4× A100 PCIe by ~40% on token throughput for V4 because MoE all-to-all expert dispatch is bandwidth-bound.
- Cold start is 6-8 minutes from weights-on-disk to first token. Plan your scaling policy around that — V4 is not a model you autoscale to zero.
- Speculative decoding with a Qwen3-1.5B draft model gets you another ~30% latency win on V4-Flash. vLLM 0.9.2+ supports this out of the box.
- Power draw on 2× H200 peaks at 1.4 kW under load. Factor in a 30A PDU if you're racking on-prem.
Programmatic access to our recommendations
We publish all model/hardware compatibility data — VRAM budgets, recommended configs, quantization quality scores — through the free BestLLMfor public API (CC BY 4.0) and the open-source MCP server. If you're building tooling that needs to answer "can this model run on this GPU at what quality," wire it in instead of reinventing the table. Our methodology page documents how the numbers are produced.
Frequently asked questions
How much VRAM do I need to run DeepSeek V4-Flash?
About 170 GB in native FP4+FP8 precision (158 GB weights + ~10 GB for a 1M-token KV cache + overhead). With INT4 AWQ quantization the requirement drops to ~95 GB, which fits on 2× RTX Pro 6000 Blackwell (192 GB total).
Can I run DeepSeek V4-Pro locally?
Technically yes, practically no. V4-Pro needs ~1.2 TB of VRAM at FP8, meaning an 8× H200 or 8× B200 node. That's a $300k+ build. For self-hosting in 2026, V4-Flash is the only realistic target. Use the API for Pro-tier capability.
Is vLLM or Ollama better for DeepSeek V4?
vLLM, by a wide margin, for any production workload. It implements native MoE expert parallelism and FP8 KV cache support. Ollama is fine for single-developer experimentation on Apple Silicon or a single big GPU, but throughput is roughly 30% of vLLM on multi-GPU nodes.
When does self-hosting beat the DeepSeek API on cost?
Around 400M tokens/month for V4-Flash on H200 spot pricing. Below that, the official API at $0.27/M input / $1.10/M output is cheaper than any self-hosted setup once you factor in utilization losses. Above 2B tokens/month, on-prem hardware pays back in ~14 months.
Does INT4 quantization hurt V4-Flash quality?
Marginally. INT4 AWQ costs ~1% on MMLU-Pro and ~1.5% on HumanEval+ vs the native FP4+FP8 build. Going below 4 bits (e.g. IQ2_XS at 2-bit) breaks MoE expert routing and drops MMLU-Pro by ~10 points. Stick to 4-bit or native.
Can I fine-tune DeepSeek V4-Flash on my own data?
LoRA adapters work on a single 8× H200 node using DeepSpeed-Inference + Unsloth's MoE patches (released May 2026). Full fine-tuning requires 32× H100 minimum and is impractical for most teams. The official DeepSeek fine-tuning API is the pragmatic path.
Verdict
DeepSeek V4-Flash is the first frontier-class open-weight model that's genuinely runnable on a single multi-GPU server. The honest path looks like this:
| Your situation | Recommendation |
|---|---|
| Solo dev, < 24 GB GPU | Use the API or pick a smaller model from our catalog. |
| Privacy-bound, modest volume | 2× RTX Pro 6000 Blackwell + INT4 AWQ. ~$18k all-in. |
| High-volume production (400M+ tok/mo) | 2× H200 + native FP4+FP8 on vLLM 0.9.3. Best $/token. |
| Enterprise, V4-Pro required | Stay on the official API. Self-hosting Pro is not viable yet. |
| Research / fine-tuning | 8× H200 node, rented hourly. Don't buy. |
Run the numbers for your token volume in our cost calculator, and if you're picking between V4-Flash and the next tier down (Qwen3-Next-80B, Llama-4-Scout), compare them head-to-head in the catalog. Questions on methodology, the data, or the API are answered on the about page.