DeepSeek V4 Gains 85% Speed With New Inference Technique

DeepSeek has introduced a new speculative decoding framework designed to accelerate large language model inference by combining semi-autoregressive token generation with confidence-based verification scheduling.

According to the company, the system increases per-user text-generation speed by 60–85% compared with its previous production baseline while maintaining the same aggregate throughput.

The researchers said the method also “enables performance tiers that were previously unattainable” under strict interactivity constraints.

Speculative decoding speeds up inference by using a lightweight draft model that proposes multiple tokens before a larger target model verifies them.

Existing parallel draft models can generate long token blocks quickly, but often suffer declining acceptance rates because each token is predicted independently. To address this, DeepSeek said the new framework “unifies high-throughput parallel generation with adaptive, load-aware verification.”

It is achieved by pairing a parallel backbone with a lightweight sequential module and dynamically adjusting the number of draft tokens verified based on predicted acceptance probabilities and system load.

On offline benchmarks covering mathematical reasoning, code generation, and chat workloads, DeepSeek said the framework consistently outperformed both autoregressive and parallel speculative decoding methods.

Across Qwen3 4B, 8B and 14B models, it increased the average accepted draft length by 26.7% to 30.9% over Eagle3 and by 16.3% to 18.4% over DFlash, allowing more tokens to be accepted in each verification round.

DeepSeek also reported production deployment results from its DeepSeek-V4 serving system under live user traffic.

Compared with the earlier MTP-1 production baseline, the new approach increased per-user generation speeds by 60–85% on DeepSeek-V4-Flash and 57–78% on DeepSeek-V4-Pro at matched throughput levels.

The company said it “mitigates verification overhead to maintain robust throughput” under high-concurrency workloads, allowing the serving system to sustain performance at stricter latency targets.

DeepSeek is open-sourcing checkpoints for the new decoding framework for DeepSeek-V4-Flash (preview) and DeepSeek-V4-Pro (preview), alongside DeepSpec, a training repository that supports Eagle3, DFlash, and the new method for speculative decoding research.

DeepSeek has also open-sourced the DSpark implementation through its DeepSpec GitHub repository and released DSpark-enabled DeepSeek-V4 Flash and Pro checkpoints on Hugging Face, allowing developers to reproduce the framework and deploy the speculative decoding module with the preview models.

The release comes days after DeepSeek completed its first external fundraising round, raising more than 50 billion yuan ($7.4 billion) at a valuation exceeding $50 billion, according to multiple reports.

Founder Liang Wenfeng reportedly contributed about $3 billion to the round, with investors including Tencent, battery maker CATL and China’s National Artificial Intelligence Industry Investment Fund.

The AI giant will use the new capital to fund AI infrastructure, product development, and conduct a major hiring push, with DeepSeek planning to at least double the size of every department as it expands beyond its research-focused roots.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Join Our Core Community

This Startup Went from a Team of 20 to 6. Yet, Humans are their Most Valued Asset.

From Generic Models to Living Twins: A Practitioner’s Guide to ML in Design Workflows

Designing AI‑Ready Public Infrastructure: Global Lessons from India’s Aadhaar Builder

What “High-Risk AI” Actually Means for the Teams Running HR, Finance and Customer Ops

DXC’s LabX is Beating AI Theatre

Banks Are Drowning in Data and Starving for Insight

Unstructured Data, Deterministic Answers

Data Layer Precedes Compute, GPU Capacity in Sovereign AI

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Palantir to Bring NVIDIA’s Nemotron Models to Sovereign Environments

Elon Musk Teases Grok 4.5, Says New Model Matches Top AI Rivals