TL;DR — three things to take away

1. Context: where V4 sits in DeepSeek's 2026 roadmap

On 2026-04-24, DeepSeek released the V4 series in preview, comprising DeepSeek-V4-Pro and DeepSeek-V4-Flash. The weights are on Hugging Face under the MIT License — fully commercial-use friendly. This is DeepSeek's first major architecture jump since V3 shook the industry in early 2025: V3 → V3.1 → V3.2 were iterations within the same family, while V4 introduces a foundational change — the DSA sparse attention mechanism.

The timing is deliberate. The Chinese open-source LLM space has been crowded over the past 30 days — Alibaba's Qwen 3.5, Z.ai's GLM-5.1, and Moonshot's new Kimi all shipped recently. By leading V4 with hard numbers like LiveCodeBench 93.5 and Codeforces 3206, DeepSeek's intent is clear: retake the "best open-source coding model" badge from the Qwen-Coder line, while pushing price an order of magnitude below comparable closed-source offerings.

Equally important: Huawei announced same-day full inference support for the V4 series, and is committing to mass-produced Ascend 950 chips that will further drive V4-Pro pricing down. In other words, V4 is not just a model upgrade; it's the first industrial-scale demonstration of a Chinese AI stack — open weights, domestic accelerators, low-cost API — actually closing the loop.

2. Specs at a glance

V4-Pro Total / Active
1.6T / 49B
MoE, 865 GB weights
V4-Flash Total / Active
284B / 13B
160 GB weights, single-node serveable
Context Window
1,000,000 tokens
Both variants; supported by DSA + token-wise compression
Pre-training Tokens
33T / 32T
Pro / Flash, comparable scale
License
MIT
Commercial use, fine-tuning, redistribution
Training Compute
~16,000 Hopper GPUs
Total cost ≈ $5.6M, ~2× efficiency over V3

3. The architectural innovation: what is DSA Sparse Attention?

In the V4 technical report, DeepSeek positions DSA (DeepSeek Sparse Attention) as the headline architectural change. It addresses one specific problem: at 1M-token context, traditional full-attention KV cache and per-step FLOPs both scale roughly quadratically — making 1M context "technically possible, economically unusable".

DSA combines two mechanisms:
  1. Sparse attention computation: each token attends only to a dynamically selected "salient subset" of other tokens, not the full context. Subset selection is learned by the model — neither a fixed window (local attention) nor a routing classifier (MoE-style) — but content-driven sparsity.
  2. Token-wise compression: at the KV-cache layer, historical tokens are compressed adaptively based on content, so older tokens occupy less memory while preserving key semantics.

Combined effect: V4-Pro uses 27% of V3.2's per-token FLOPs and 10% of its KV cache; V4-Flash is even more aggressive at 10% FLOPs and 7% memory. On the same hardware, V4 can serve roughly an order of magnitude more concurrent requests.

Conceptually DSA shares ground with Mistral's Sliding Window Attention, the long-context moves Anthropic and others have made for Claude, and the state-space approach in Mamba — all trade full-connectivity attention for long-context practicality. What's distinctive about DeepSeek's version is how thoroughly the sparse assumption is baked into the entire pipeline: the model is trained, not just inference-patched, under the sparse regime. That's why V4 ships 1M context as a default capability rather than a special configuration.

4. Benchmarks: V4-Pro vs GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1-Pro

Below is V4-Pro (in highest-reasoning "Pro-Max" mode) compared head-to-head with the three closed-source frontier models. Numbers are from DeepSeek's official technical report cross-checked against published third-party numbers. Green marks where V4-Pro takes the best (or tied-best) result; orange is close; red is a clear gap.

Benchmark DeepSeek V4-Pro Claude Opus 4.6 GPT-5.4 Gemini 3.1-Pro
LiveCodeBench 93.5 88.8 91.7
Codeforces Rating 3,206 3,168
SWE-bench Verified 80.6 80.8 80.6
MMLU-Pro 87.5 89.1 87.5 91.0
GPQA Diamond 90.1 91.3 93.0 94.3
Terminal-Bench 2.0 67.9 65.4 75.1 68.5
HLE (Hard Logic Eval) 37.7

How to read this

The conclusion is clean: V4-Pro is the open-source frontier, but not yet the absolute frontier. If your workload is centered on coding, algorithms, or long-document handling, V4-Pro is performance-equivalent and an order of magnitude cheaper. If you need top-tier scientific reasoning, closed-source flagships still have the edge — but the gap is narrowing.

5. Pricing: a structural cost shift

Model Input ($/1M tokens) Output ($/1M tokens) Context License
DeepSeek V4-Pro 1.74 3.48 1M MIT (open)
DeepSeek V4-Flash 0.14 0.28 1M MIT (open)
OpenAI GPT-5.4~10~30200KClosed
Anthropic Claude Opus 4.6~7.5~25500KClosed
Moonshot Kimi (latest)~1.5~4200KClosed API
Alibaba Qwen-3.5-Max~0.8~2.41MMixed (mid-size open)

V4-Pro's output price is roughly 9× cheaper than OpenAI and 7× cheaper than Claude. V4-Flash output at $0.28/M is approximately 1/100 of OpenAI's. Even within the Chinese open-source camp, V4-Flash undercuts Qwen-3.5-Max by 8–9×.

This isn't a promotional price. Combined with the architecture changes above, V4's pricing is structural — MoE activation efficiency + DSA-driven KV reduction + Ascend 950 inference optimization stack into a real per-token cost reduction. Even if OpenAI or Anthropic respond with price cuts, matching this floor is hard.

6. Selection guide: when to use V4 — and when not to

V4-Pro is the natural choice when…

Large-codebase refactors, full-repo reviews, multi-file edits: 1M context plus top-tier LiveCodeBench/SWE-bench performance — V4-Pro is the most cost-effective "global code understanding" model available.

Long-document / PDF / contract analysis: 1M context fits an entire book or dozens of contracts; tokens cost about 1/7 of Claude.

Cost-sensitive Agent orchestration at scale: if your Agent system burns thousands of dollars in tokens daily, V4-Pro can shave roughly 80% off your LLM bill.

V4-Flash is the natural choice when…

Large-scale online Q&A / customer support: $0.28/M output plus 1M context fits an entire knowledge base and chat history.

Batch pipelines (data cleaning, summarization, translation): at Flash's price you can run full-scale jobs at high concurrency without flinching.

Local deployment (160 GB weights): a single 8×A100 or 8×H100 node serves it; once Unsloth's quantized builds drop, even 4×H100 or smaller will work.

Where V4 is NOT the right pick:

7. Three ways to access V4

A. DeepSeek's official API (most direct)

It's OpenAI-compatible — just swap the base_url:

from openai import OpenAI client = OpenAI( api_key="YOUR_DEEPSEEK_KEY", base_url="https://api.deepseek.com/v1" ) resp = client.chat.completions.create( model="deepseek-v4-pro", # or "deepseek-v4-flash" messages=[{"role": "user", "content": "Explain DSA sparse attention."}], max_tokens=2048, ) print(resp.choices[0].message.content)

B. OpenRouter (multi-provider routing)

If you already use OpenRouter for multi-model fallback, the model IDs are deepseek/deepseek-v4-pro and deepseek/deepseek-v4-flash. There's a 5–10% routing markup, but you get one API key with seamless fallback to Claude or GPT-5 when needed.

C. Self-hosting via Hugging Face + vLLM

For teams already self-hosting V3/V3.2 or with strict data-residency requirements: V4-Flash (160 GB) runs on a single 8×H100 box; V4-Pro (865 GB) needs a 4–8 node cluster with high-bandwidth interconnect. Unsloth-quantized GGUF builds are imminent and will further lower the hardware bar.

8. How V4 fits into this week's bigger trends

In our weekly highlights this week we identified three trends — multimodal generation entering industrial deployment, latent reasoning challenging explicit CoT, and Agent training becoming systematic. DeepSeek V4 strongly validates two of them:

The third trend (multimodal latent reasoning) doesn't show up in V4 — V4 is a text-only release. A multimodal version is expected in V4.5 or V5. This matches DeepSeek's standing strategy: each generation drives one breakthrough through to industrial scale rather than shipping a full multimodal stack at once.

9. The under-appreciated signal: open source has crossed the cost line

For the past three years, the open-source LLM story has been "performance gradually catching closed-source." V4 changes the script — it delivers ~80% of frontier performance plus a step-change in cost.

Take LiveCodeBench (probably the most engineering-relevant code benchmark) as a quick napkin calculation: V4-Pro (93.5 score, $3.48/M output) is ~27 score-points per dollar; Claude Opus 4.6 (88.8 score, $25/M output) is ~3.6 score-points per dollar. At equal budgets, V4-Pro produces about 7.5× the high-quality code-token throughput of Claude.

When an open-source model holds parity in the main battleground (coding) at 7–9× lower cost, the closed-source moat reduces to two things: (1) knowledge depth (where Gemini still leads on GPQA / MMLU); and (2) tooling and ecosystem maturity (function-calling SDKs, structured-output validation, enterprise SLAs). The first will erode with time; the second is exactly where Anthropic and OpenAI need to spend their next year.

10. What to watch over the next 30 days

V4 is a release that gets three things right at once: architecture-wise, DSA makes 1M context truly usable; performance-wise, it pulls open source to parity with Claude Opus 4.6 / Gemini 3.1-Pro; commercially, structural pricing plus Ascend integration shifts the cost curve of LLM infrastructure.

Four follow-ups worth tracking:

  1. Independent benchmark replications: nearly all numbers today are official; LMSys, Aider, Vellum, and similar leaderboards will validate or contradict.
  2. Unsloth quantization and local-deployment ecosystem: if V4-Flash quantized cleanly to a single H100, expect a wave of "self-hosted flagship" use cases in the open-source community.
  3. OpenAI / Anthropic price responses: how aggressively GPT-5.4 and Claude Opus respond will determine how long V4's price advantage holds.
  4. Ascend 950 mass-production cadence: DeepSeek has signaled further price cuts contingent on Ascend 950 ramp. This is the milestone for whether a Chinese AI stack can truly de-risk from Nvidia.

Want to track new AI models and trending papers daily? Paper Collector pulls Hugging Face papers and open-source projects every day with summaries.

Visit Paper Collector

Sources: DeepSeek's official V4 technical report and release notes, Hugging Face model cards, Simon Willison's hands-on notes, Bloomberg, CNBC, Fortune, MIT Technology Review, Macaron benchmark compilation. Written 2026-04-25; some independent benchmarks are still being replicated. Pricing and availability are per the official release. Trend judgments reflect the team's view; for reference only.