What's the core difference between DeepSeek V4 and V3?

V4 introduces DSA (DeepSeek Sparse Attention) plus token-wise compression, making 1M-token inference practical. On comparable tasks V4-Pro uses only 27% of V3.2's per-token FLOPs and just 10% of its KV cache. Performance-wise, V4 closes the gap with GPT-5.4 and Gemini 3.1-Pro and beats every same-tier open-source model on coding and math.

How do I choose between V4-Pro and V4-Flash?

Use V4-Pro (1.6T total / 49B active) for complex reasoning, long-context, and multi-step Agent workloads. Use V4-Flash (284B / 13B active) for online Q&A, short-form generation, and large-scale batch pipelines — it costs about 1/12 of Pro. If your workload is latency-sensitive and runs at scale, Flash is the default.

How much cheaper is the V4 API than OpenAI and Claude?

V4-Pro outputs at $3.48/M tokens — about 1/9 of OpenAI ($30) and 1/7 of Claude ($25). V4-Flash outputs at $0.28/M, roughly 1/100 of OpenAI. The price drop is structural: MoE activation efficiency, DSA-driven KV-cache reduction, and Huawei Ascend inference together compound.

Is DeepSeek V4 open-source? What license?

Yes. Both variants are released under the MIT License, allowing commercial use, fine-tuning, and redistribution. Full weights are on Hugging Face (about 865 GB for V4-Pro, 160 GB for V4-Flash). The Unsloth team is preparing GGUF-quantized builds for local deployment.

Can DeepSeek V4 actually train on Huawei Ascend chips?

The core training was still done on Nvidia Hopper (about 16,000 GPUs). Huawei has announced full inference support for the V4 series, and DeepSeek is co-designing with Ascend 950 mass-production for inference scaling. Training is still in the Nvidia ecosystem; inference is migrating to Ascend.

DeepSeek V4 Deep Dive: DSA Sparse Attention, 1M Context, vs GPT-5 / Claude / Gemini

TL;DR — three things to take away

1.6T MoE with 49B activated: DeepSeek-V4-Pro keeps flagship parameter scale while cutting per-token FLOPs to 27% of V3.2 and KV-cache to 10% — thanks to the new DSA (DeepSeek Sparse Attention) plus token-wise compression that finally makes 1M-token context practical.
The first time open source has caught up to the closed-source frontier head-on: V4-Pro matches Claude Opus 4.6 and Gemini 3.1-Pro on LiveCodeBench, SWE-bench Verified, and Codeforces; on MMLU-Pro and GPQA Diamond it still trails Gemini 3.1-Pro by roughly 3–6 months.
About 9× cheaper than OpenAI, 7× cheaper than Claude, with full inference optimization for Huawei's Ascend 950 — V4 is not just a model release, it's the first industrial-scale proof point for China's "open weights + domestic chips + low-cost API" stack.

1. Context: where V4 sits in DeepSeek's 2026 roadmap

On 2026-04-24, DeepSeek released the V4 series in preview, comprising DeepSeek-V4-Pro and DeepSeek-V4-Flash. The weights are on Hugging Face under the MIT License — fully commercial-use friendly. This is DeepSeek's first major architecture jump since V3 shook the industry in early 2025: V3 → V3.1 → V3.2 were iterations within the same family, while V4 introduces a foundational change — the DSA sparse attention mechanism.

The timing is deliberate. The Chinese open-source LLM space has been crowded over the past 30 days — Alibaba's Qwen 3.5, Z.ai's GLM-5.1, and Moonshot's new Kimi all shipped recently. By leading V4 with hard numbers like LiveCodeBench 93.5 and Codeforces 3206, DeepSeek's intent is clear: retake the "best open-source coding model" badge from the Qwen-Coder line, while pushing price an order of magnitude below comparable closed-source offerings.

Equally important: Huawei announced same-day full inference support for the V4 series, and is committing to mass-produced Ascend 950 chips that will further drive V4-Pro pricing down. In other words, V4 is not just a model upgrade; it's the first industrial-scale demonstration of a Chinese AI stack — open weights, domestic accelerators, low-cost API — actually closing the loop.

2. Specs at a glance

V4-Pro Total / Active

1.6T / 49B

MoE, 865 GB weights

V4-Flash Total / Active

284B / 13B

160 GB weights, single-node serveable

Context Window

1,000,000 tokens

Both variants; supported by DSA + token-wise compression

Pre-training Tokens

33T / 32T

Pro / Flash, comparable scale

License

MIT

Commercial use, fine-tuning, redistribution

Training Compute

~16,000 Hopper GPUs

Total cost ≈ $5.6M, ~2× efficiency over V3

3. The architectural innovation: what is DSA Sparse Attention?

In the V4 technical report, DeepSeek positions DSA (DeepSeek Sparse Attention) as the headline architectural change. It addresses one specific problem: at 1M-token context, traditional full-attention KV cache and per-step FLOPs both scale roughly quadratically — making 1M context "technically possible, economically unusable".

DSA combines two mechanisms:

Sparse attention computation: each token attends only to a dynamically selected "salient subset" of other tokens, not the full context. Subset selection is learned by the model — neither a fixed window (local attention) nor a routing classifier (MoE-style) — but content-driven sparsity.
Token-wise compression: at the KV-cache layer, historical tokens are compressed adaptively based on content, so older tokens occupy less memory while preserving key semantics.

Combined effect: V4-Pro uses 27% of V3.2's per-token FLOPs and 10% of its KV cache; V4-Flash is even more aggressive at 10% FLOPs and 7% memory. On the same hardware, V4 can serve roughly an order of magnitude more concurrent requests.

Conceptually DSA shares ground with Mistral's Sliding Window Attention, the long-context moves Anthropic and others have made for Claude, and the state-space approach in Mamba — all trade full-connectivity attention for long-context practicality. What's distinctive about DeepSeek's version is how thoroughly the sparse assumption is baked into the entire pipeline: the model is trained, not just inference-patched, under the sparse regime. That's why V4 ships 1M context as a default capability rather than a special configuration.

4. Benchmarks: V4-Pro vs GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1-Pro

Below is V4-Pro (in highest-reasoning "Pro-Max" mode) compared head-to-head with the three closed-source frontier models. Numbers are from DeepSeek's official technical report cross-checked against published third-party numbers. Green marks where V4-Pro takes the best (or tied-best) result; orange is close; red is a clear gap.

Benchmark	DeepSeek V4-Pro	Claude Opus 4.6	GPT-5.4	Gemini 3.1-Pro
LiveCodeBench	93.5	88.8	—	91.7
Codeforces Rating	3,206	—	3,168	—
SWE-bench Verified	80.6	80.8	—	80.6
MMLU-Pro	87.5	89.1	87.5	91.0
GPQA Diamond	90.1	91.3	93.0	94.3
Terminal-Bench 2.0	67.9	65.4	75.1	68.5
HLE (Hard Logic Eval)	37.7	—	—	—

How to read this

Coding and competitive programming: V4-Pro is the top open-source result and ties or beats closed-source frontiers. LiveCodeBench 93.5 leads both GPT-5.4 and Gemini 3.1-Pro, and Codeforces 3206 edges out GPT-5.4. SWE-bench Verified is a three-way tie with Claude Opus 4.6 and Gemini 3.1-Pro.
Knowledge-heavy and scientific reasoning are still gaps. MMLU-Pro trails Gemini by 3.5 points; GPQA Diamond trails by 4.2 points. DeepSeek itself acknowledges a "3–6 month lag" behind the SOTA frontier on these.
Command-line agentic work (Terminal-Bench): V4 is on par with Claude/Gemini, but materially behind GPT-5.4. GPT-5.4's 75.1 remains the single best score for tool-using agents.

The conclusion is clean: V4-Pro is the open-source frontier, but not yet the absolute frontier. If your workload is centered on coding, algorithms, or long-document handling, V4-Pro is performance-equivalent and an order of magnitude cheaper. If you need top-tier scientific reasoning, closed-source flagships still have the edge — but the gap is narrowing.

5. Pricing: a structural cost shift

Model	Input ($/1M tokens)	Output ($/1M tokens)	Context	License
DeepSeek V4-Pro	1.74	3.48	1M	MIT (open)
DeepSeek V4-Flash	0.14	0.28	1M	MIT (open)
OpenAI GPT-5.4	~10	~30	200K	Closed
Anthropic Claude Opus 4.6	~7.5	~25	500K	Closed
Moonshot Kimi (latest)	~1.5	~4	200K	Closed API
Alibaba Qwen-3.5-Max	~0.8	~2.4	1M	Mixed (mid-size open)

V4-Pro's output price is roughly 9× cheaper than OpenAI and 7× cheaper than Claude. V4-Flash output at $0.28/M is approximately 1/100 of OpenAI's. Even within the Chinese open-source camp, V4-Flash undercuts Qwen-3.5-Max by 8–9×.

This isn't a promotional price. Combined with the architecture changes above, V4's pricing is structural — MoE activation efficiency + DSA-driven KV reduction + Ascend 950 inference optimization stack into a real per-token cost reduction. Even if OpenAI or Anthropic respond with price cuts, matching this floor is hard.

6. Selection guide: when to use V4 — and when not to

V4-Pro is the natural choice when…

✓ Large-codebase refactors, full-repo reviews, multi-file edits: 1M context plus top-tier LiveCodeBench/SWE-bench performance — V4-Pro is the most cost-effective "global code understanding" model available.

✓ Long-document / PDF / contract analysis: 1M context fits an entire book or dozens of contracts; tokens cost about 1/7 of Claude.

✓ Cost-sensitive Agent orchestration at scale: if your Agent system burns thousands of dollars in tokens daily, V4-Pro can shave roughly 80% off your LLM bill.

V4-Flash is the natural choice when…

✓ Large-scale online Q&A / customer support: $0.28/M output plus 1M context fits an entire knowledge base and chat history.

✓ Batch pipelines (data cleaning, summarization, translation): at Flash's price you can run full-scale jobs at high concurrency without flinching.

✓ Local deployment (160 GB weights): a single 8×A100 or 8×H100 node serves it; once Unsloth's quantized builds drop, even 4×H100 or smaller will work.

Where V4 is NOT the right pick:

Scientific Q&A / complex multi-step reasoning: GPQA Diamond still trails Gemini 3.1-Pro by 4 points. For research-grade Q&A systems, closed-source flagships remain the safer bet.
Mature function-calling and tool ecosystems: V4 supports OpenAI-compatible function calls, but the surrounding tooling (SDKs, retry/timeout, structured-output validators) is far less mature than OpenAI's or Anthropic's.
Highly regulated / "model auditability" required: in finance, healthcare, or legal production environments, closed-source flagships ship enterprise contracts and SLAs. DeepSeek's weights are open, but its training-data composition has not been fully disclosed.
You absolutely need the best reasoning: DeepSeek itself acknowledges a 3–6 month gap to SOTA. For workloads that demand the absolute frontier (formal math proofs, hard olympiad problems), wait for the next generation or stick with Gemini 3.1-Pro.

7. Three ways to access V4

A. DeepSeek's official API (most direct)

It's OpenAI-compatible — just swap the base_url:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPSEEK_KEY",
    base_url="https://api.deepseek.com/v1"
)

resp = client.chat.completions.create(
    model="deepseek-v4-pro",   # or "deepseek-v4-flash"
    messages=[{"role": "user", "content": "Explain DSA sparse attention."}],
    max_tokens=2048,
)
print(resp.choices[0].message.content)

B. OpenRouter (multi-provider routing)

If you already use OpenRouter for multi-model fallback, the model IDs are deepseek/deepseek-v4-pro and deepseek/deepseek-v4-flash. There's a 5–10% routing markup, but you get one API key with seamless fallback to Claude or GPT-5 when needed.

C. Self-hosting via Hugging Face + vLLM

For teams already self-hosting V3/V3.2 or with strict data-residency requirements: V4-Flash (160 GB) runs on a single 8×H100 box; V4-Pro (865 GB) needs a 4–8 node cluster with high-bandwidth interconnect. Unsloth-quantized GGUF builds are imminent and will further lower the hardware bar.

8. How V4 fits into this week's bigger trends

In our weekly highlights this week we identified three trends — multimodal generation entering industrial deployment, latent reasoning challenging explicit CoT, and Agent training becoming systematic. DeepSeek V4 strongly validates two of them:

"Industrial deployment" signal: 1M context shifts from "benchmark stunt" to "default capability," and per-token cost drops by an order of magnitude. This is the most important infrastructure shift for LLMs moving from research/demo to scalable production.
"Agent training systematized" signal: Codeforces 3206, Terminal-Bench 67.9, and SWE-bench Verified 80.6 mean V4 isn't just "writing code" — it's executing real multi-step engineering tasks. This is a direct extension of the long-horizon agent trajectory we covered when discussing AiScientist and OccuBench.

The third trend (multimodal latent reasoning) doesn't show up in V4 — V4 is a text-only release. A multimodal version is expected in V4.5 or V5. This matches DeepSeek's standing strategy: each generation drives one breakthrough through to industrial scale rather than shipping a full multimodal stack at once.

9. The under-appreciated signal: open source has crossed the cost line

For the past three years, the open-source LLM story has been "performance gradually catching closed-source." V4 changes the script — it delivers ~80% of frontier performance plus a step-change in cost.

Take LiveCodeBench (probably the most engineering-relevant code benchmark) as a quick napkin calculation: V4-Pro (93.5 score, $3.48/M output) is ~27 score-points per dollar; Claude Opus 4.6 (88.8 score, $25/M output) is ~3.6 score-points per dollar. At equal budgets, V4-Pro produces about 7.5× the high-quality code-token throughput of Claude.

When an open-source model holds parity in the main battleground (coding) at 7–9× lower cost, the closed-source moat reduces to two things: (1) knowledge depth (where Gemini still leads on GPQA / MMLU); and (2) tooling and ecosystem maturity (function-calling SDKs, structured-output validation, enterprise SLAs). The first will erode with time; the second is exactly where Anthropic and OpenAI need to spend their next year.

10. What to watch over the next 30 days

V4 is a release that gets three things right at once: architecture-wise, DSA makes 1M context truly usable; performance-wise, it pulls open source to parity with Claude Opus 4.6 / Gemini 3.1-Pro; commercially, structural pricing plus Ascend integration shifts the cost curve of LLM infrastructure.

Four follow-ups worth tracking:

Independent benchmark replications: nearly all numbers today are official; LMSys, Aider, Vellum, and similar leaderboards will validate or contradict.
Unsloth quantization and local-deployment ecosystem: if V4-Flash quantized cleanly to a single H100, expect a wave of "self-hosted flagship" use cases in the open-source community.
OpenAI / Anthropic price responses: how aggressively GPT-5.4 and Claude Opus respond will determine how long V4's price advantage holds.
Ascend 950 mass-production cadence: DeepSeek has signaled further price cuts contingent on Ascend 950 ramp. This is the milestone for whether a Chinese AI stack can truly de-risk from Nvidia.

Want to track new AI models and trending papers daily? Paper Collector pulls Hugging Face papers and open-source projects every day with summaries.

Visit Paper Collector

Sources: DeepSeek's official V4 technical report and release notes, Hugging Face model cards, Simon Willison's hands-on notes, Bloomberg, CNBC, Fortune, MIT Technology Review, Macaron benchmark compilation. Written 2026-04-25; some independent benchmarks are still being replicated. Pricing and availability are per the official release. Trend judgments reflect the team's view; for reference only.

DeepSeek V4 Deep Dive: DSA Sparse Attention, 1M Context, and a Full Comparison with GPT-5 / Claude / Gemini