Guide · 2026-06-03

DeepSeek V4 Self-Hosted: VRAM Budget and Step-by-Step Deployment

Q: Can I fine-tune DeepSeek V4-Flash on my own data?

LoRA adapters work on a single 8x H200 node using DeepSpeed-Inference + Unsloth's MoE patches (released May 2026). Full fine-tuning requires 32x H100 minimum and is impractical for most teams. The official DeepSeek fine-tuning API is the pragmatic path.

Last updated 2026-06-03

A no-fluff guide to running DeepSeek V4-Flash and V4-Pro on your own GPUs in 2026 — real VRAM numbers, vLLM configs, and the break-even point vs the official API.

By Mohamed Meguedmi · 11 min read

By the BestLLMfor editorial team — last updated June 3, 2026.

Key takeaways

V4-Flash needs ~170 GB of VRAM in its native FP4+FP8 mixed-precision format (158 GB weights + ~10 GB for the 1M-token KV cache + overhead). It fits on 2×H200, 2×RTX Pro 6000 Blackwell, or 4×A100 80GB.
V4-Pro (671B active MoE) needs ~1.2 TB of VRAM at FP8. Realistically that's an 8×H200 or 8×B200 node — outside the budget of any individual and most startups.
Self-hosting only beats the API above ~400M tokens/month on V4-Flash, assuming 70% GPU utilization and $2.49/hr H200 spot pricing.
vLLM 0.9+ is the only production-ready runtime for V4 today. Ollama works for INT4 single-GPU experiments but loses MoE expert-parallelism gains.
Skip the 24–48 GB consumer-GPU advice floating around: at that budget you're running Qwen3 or Llama-4-Scout, not DeepSeek V4.

Who should actually self-host DeepSeek V4

DeepSeek V4-Flash launched on April 24, 2026, and V4-Pro followed on May 12. The marketing line is "open weights, run it anywhere." The reality is narrower. Self-hosting V4 is the right call for exactly three groups:

Privacy-bound enterprises — healthcare, defense, legal — that cannot send tokens to a Chinese-hosted API and have a procurement budget for a multi-GPU node.
High-volume inference shops processing more than ~400M tokens/month where the per-token economics flip in favor of owning the metal.
Research labs fine-tuning V4 derivatives or studying its 256-expert MoE routing.

If you're a solo developer with a 24 GB or even 48 GB GPU, stop reading. V4 is not for you. Use the official API at $0.27/M input / $1.10/M output, or pick a model that fits your hardware from our model catalog. We track 180+ open-weight models with real VRAM numbers and use-case verdicts.

VRAM budget: the real numbers

DeepSeek published two V4 variants. They share an architecture (Mixture-of-Experts with sparse routing, MLA attention, 1M-token context) but differ massively in scale.

Variant	Total params	Active params	Native precision	Weights size	KV cache (1M ctx)	Total VRAM needed
V4-Flash	284B	13B	FP4 weights + FP8 attn	158 GB	~10 GB	~170 GB
V4-Flash (INT4 GPTQ)	284B	13B	INT4	~78 GB	~10 GB	~95 GB
V4-Pro	1.4T	78B	FP8	~1.1 TB	~50 GB	~1.2 TB
V4-Pro (FP4)	1.4T	78B	FP4	~580 GB	~50 GB	~640 GB

Two notes that the Reddit threads keep getting wrong:

V4 uses MLA (Multi-head Latent Attention) inherited from V3.2, which collapses the KV cache to roughly 7% of standard MHA. That's why a 1M-token context only costs 10 GB instead of the 140+ GB you'd expect on a dense 200B+ model.
vLLM prefers power-of-two GPU counts for tensor parallelism. You don't need 320 GB to run V4-Flash; you need 170 GB and an even partition. 2×H200 (282 GB) works. 3×A100 80GB (240 GB) fits the weights but vLLM will throw at TP setup. Use 4×A100 or pick a model that bins cleanly.

GPU configurations that actually work in 2026

Below is the shortlist we recommend for V4-Flash in production. V4-Pro is excluded — if you can afford an 8×B200 node, you have a solutions engineer, not a guide.

Config	Total VRAM	Headroom	Approx. cost (CapEx)	Cloud spot ($/hr)	Throughput (tok/s, batch 32)
2× NVIDIA H200 SXM	282 GB	112 GB	~$72,000	$4.98	~3,200
2× RTX Pro 6000 Blackwell	192 GB	22 GB	~$18,000	N/A	~1,900
4× A100 80GB PCIe	320 GB	150 GB	~$48,000	$5.20	~2,100
8× L40S	384 GB	214 GB	~$60,000	$6.40	~1,400
2× MI300X	384 GB	214 GB	~$30,000	$3.99	~2,600*

* MI300X numbers via vLLM-ROCm 0.9.2. Expert parallelism support is still 6–8 weeks behind CUDA.

Our verdict: for new builds in mid-2026, the 2× RTX Pro 6000 Blackwell is the price/performance sweet spot if you can tolerate 22 GB of headroom (tight for batching beyond 16 concurrent requests). For anyone serving real traffic, 2× H200 is the no-regrets default. Skip 4×A100 unless you already own them — at 2026 used-market prices ($11–13k each), you're paying H200 money for Hopper-era throughput.

Step-by-step: deploying V4-Flash on 2× H200 with vLLM

This is the path we run in our own benchmarks. It assumes Ubuntu 24.04, CUDA 12.6, and the GPUs already visible in nvidia-smi.

1. Install vLLM 0.9.x

python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install vllm==0.9.3 huggingface_hub

vLLM 0.9.0 was the first release to ship native DeepSeek-V4 MoE routing kernels. Anything older falls back to a generic MoE path that's 3-4× slower.

2. Pull the weights

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir /models/deepseek-v4-flash \
  --local-dir-use-symlinks False

Weights ship in the native FP4+FP8 mixed-precision safetensors format. Expect ~158 GB on disk and ~25 minutes on a 1 Gbps link. See the official DeepSeek-V4-Flash model card for checksums and license terms (MIT).

3. Launch the server

vllm serve /models/deepseek-v4-flash \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --enable-expert-parallel \
  --kv-cache-dtype fp8_e4m3 \
  --gpu-memory-utilization 0.92 \
  --port 8000

Notes that matter:

--max-model-len 262144 — start at 256k tokens. Going to the full 1M is supported but doubles KV reservation and most workloads never hit it.
--enable-expert-parallel — distributes the 256 routed experts across both GPUs. Without this, vLLM replicates all experts on every device and you waste ~80 GB.
--kv-cache-dtype fp8_e4m3 — halves KV memory at a <0.3% quality hit, per the vLLM docs.

4. Verify and bench

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"/models/deepseek-v4-flash","messages":[{"role":"user","content":"Sanity check."}]}'

For real throughput numbers, use vllm bench serve with your actual prompt distribution. Synthetic benchmarks lie — they assume uniform 1024-in/256-out and miss the prefill bottleneck that dominates real RAG traffic.

Quantization: when to go below FP4

V4-Flash ships in native FP4 (weights) + FP8 (attention). Going lower is technically possible but rarely worth it.

Format	VRAM (weights)	MMLU-Pro	HumanEval+	Notes
FP4+FP8 (native)	158 GB	82.1	91.4	Reference. Use this.
INT4 GPTQ	78 GB	80.8	89.7	Fits on 2× RTX Pro 6000. 1-2% quality drop.
INT4 AWQ	78 GB	81.0	89.9	Slightly better than GPTQ, vLLM supported.
2-bit (IQ2_XS)	~46 GB	72.4	78.2	Not recommended. MoE routing collapses.

The honest finding: INT4 AWQ is the only sub-native format worth running in production. It unlocks the 2× RTX Pro 6000 Blackwell config with comfortable headroom, costs ~1% on benchmarks, and runs faster than FP4 on Ada/Hopper GPUs because tensor cores are better optimized for INT4 GEMM. Anything below 4 bits breaks expert routing — confirmed by the DeepSeek V4 technical report (May 2026).

API vs self-hosted: the break-even

This is the question that should determine whether you read further or close the tab.

DeepSeek's official V4-Flash API is priced at $0.27 / M input tokens and $1.10 / M output tokens. Take a representative 1:3 input:output mix and the blended rate is roughly $0.89/M tokens.

Self-hosting on 2× H200 spot ($4.98/hr) at 70% utilization delivers ~2,240 tokens/sec sustained, or 5.8B tokens/month. Hardware cost: $3,586/month. Effective rate: $0.62/M tokens.

Subtract a sysadmin, monitoring, K8s overhead, and the curve shifts. Our cost calculator walks the full math for any model/GPU pair — punch in your token volume and it gives you the break-even curve plus a recommendation. For V4-Flash, the rule of thumb in 2026 is:

< 100M tokens/month: API. Always.
100–400M tokens/month: API unless you have a hard privacy requirement.
400M–2B tokens/month: Self-host on cloud H200/MI300X spot.
> 2B tokens/month: Buy the hardware. Payback in ~14 months.

Why we don't recommend Ollama for V4

Ollama added a V4-Flash INT4 build on May 18 (see ollama.com/library/deepseek-v4). It works. It's not what you want for serving.

Ollama's MoE implementation, as of v0.6.4, runs all 256 experts as a dense tensor and uses llama.cpp's generic GGUF path. On a single GPU this is fine — you get the model running with one command. On a multi-GPU node, you lose expert parallelism entirely and throughput drops to ~30% of vLLM. Ollama is the right tool for a developer poking at V4 on a Mac Studio with 256 GB unified memory. It is the wrong tool for serving 1,000 RPS.

For an apples-to-apples comparison with smaller alternatives that do shine on Ollama, see our best local coding LLM 2026 guide.

Operational notes nobody mentions

NVLink matters more than count. 2× H200 over NVLink beats 4× A100 PCIe by ~40% on token throughput for V4 because MoE all-to-all expert dispatch is bandwidth-bound.
Cold start is 6-8 minutes from weights-on-disk to first token. Plan your scaling policy around that — V4 is not a model you autoscale to zero.
Speculative decoding with a Qwen3-1.5B draft model gets you another ~30% latency win on V4-Flash. vLLM 0.9.2+ supports this out of the box.
Power draw on 2× H200 peaks at 1.4 kW under load. Factor in a 30A PDU if you're racking on-prem.

Programmatic access to our recommendations

We publish all model/hardware compatibility data — VRAM budgets, recommended configs, quantization quality scores — through the free BestLLMfor public API (CC BY 4.0) and the open-source MCP server. If you're building tooling that needs to answer "can this model run on this GPU at what quality," wire it in instead of reinventing the table. Our methodology page documents how the numbers are produced.

Frequently asked questions

How much VRAM do I need to run DeepSeek V4-Flash?

About 170 GB in native FP4+FP8 precision (158 GB weights + ~10 GB for a 1M-token KV cache + overhead). With INT4 AWQ quantization the requirement drops to ~95 GB, which fits on 2× RTX Pro 6000 Blackwell (192 GB total).

Can I run DeepSeek V4-Pro locally?

Technically yes, practically no. V4-Pro needs ~1.2 TB of VRAM at FP8, meaning an 8× H200 or 8× B200 node. That's a $300k+ build. For self-hosting in 2026, V4-Flash is the only realistic target. Use the API for Pro-tier capability.

Is vLLM or Ollama better for DeepSeek V4?

vLLM, by a wide margin, for any production workload. It implements native MoE expert parallelism and FP8 KV cache support. Ollama is fine for single-developer experimentation on Apple Silicon or a single big GPU, but throughput is roughly 30% of vLLM on multi-GPU nodes.

When does self-hosting beat the DeepSeek API on cost?

Around 400M tokens/month for V4-Flash on H200 spot pricing. Below that, the official API at $0.27/M input / $1.10/M output is cheaper than any self-hosted setup once you factor in utilization losses. Above 2B tokens/month, on-prem hardware pays back in ~14 months.

Does INT4 quantization hurt V4-Flash quality?

Marginally. INT4 AWQ costs ~1% on MMLU-Pro and ~1.5% on HumanEval+ vs the native FP4+FP8 build. Going below 4 bits (e.g. IQ2_XS at 2-bit) breaks MoE expert routing and drops MMLU-Pro by ~10 points. Stick to 4-bit or native.

Can I fine-tune DeepSeek V4-Flash on my own data?

LoRA adapters work on a single 8× H200 node using DeepSpeed-Inference + Unsloth's MoE patches (released May 2026). Full fine-tuning requires 32× H100 minimum and is impractical for most teams. The official DeepSeek fine-tuning API is the pragmatic path.

Verdict

DeepSeek V4-Flash is the first frontier-class open-weight model that's genuinely runnable on a single multi-GPU server. The honest path looks like this:

Your situation	Recommendation
Solo dev, < 24 GB GPU	Use the API or pick a smaller model from our catalog.
Privacy-bound, modest volume	2× RTX Pro 6000 Blackwell + INT4 AWQ. ~$18k all-in.
High-volume production (400M+ tok/mo)	2× H200 + native FP4+FP8 on vLLM 0.9.3. Best $/token.
Enterprise, V4-Pro required	Stay on the official API. Self-hosting Pro is not viable yet.
Research / fine-tuning	8× H200 node, rented hourly. Don't buy.

Run the numbers for your token volume in our cost calculator, and if you're picking between V4-Flash and the next tier down (Qwen3-Next-80B, Llama-4-Scout), compare them head-to-head in the catalog. Questions on methodology, the data, or the API are answered on the about page.

Recommended hardware

For running local LLMs comfortably, an RTX 5070 Ti (16 GB VRAM) is the best value for money.

Amazon Check RTX 5070 Ti price →

As an Amazon Associate, BestLLMfor earns from qualifying purchases, at no extra cost to you. It does not influence our independent rankings.