DeepSeek has released new research showing that a promising but fragile neural network design can be stabilised at scale, delivering measurable performance gains in large language models without significantly compromising efficiency.
The paper, titled Manifold-Constrained Hyper-Connections, builds on an emerging architectural approach known as ‘Hyper-Connections’, which allows multiple residual pathways inside a model to mix dynamically rather than follow a single fixed route.
The idea is to give models more internal flexibility, enabling stronger reasoning and more effective use of parameters as they scale.
Earlier versions of this design, however, proved difficult to train at large sizes.
Unconstrained mixing unintentionally amplified or suppressed signals across layers, leading to what the authors describe as “severe numerical instability” as models became deeper. In practice, this resulted in unstable gradients and sudden training failures at larger scales.
DeepSeek’s contribution is a constrained version of the architecture that limits residual mixing to redistributing information rather than amplifying it, ensuring what the paper calls “bounded signal propagation across depth.” The constraint restores training stability while preserving the benefits of richer internal routing.
Models using the approach trained reliably up to 27 billion parameters, a scale at which unconstrained Hyper-Connections failed.
On BIG-Bench Hard, a benchmark focused on complex, multi-step reasoning, accuracy rose from 43.8% to 51.0%.
Performance also improved on DROP, a benchmark testing numerical and logical reasoning over long passages, and on GSM8K, a standard test of mathematical reasoning.
Crucially, these gains came with only a ~6–7% increase in training overhead, suggesting the approach could be viable for production-scale models.
The company has published a technical report that provides an extensive account of the methodology and findings of the research.
DeepSeek’s work points to a broader implication. Meaningful performance improvements may increasingly come from architectural refinements, not just larger models or more data.
The work also fits into a broader pattern in DeepSeek’s research strategy.
The lab was previously credited with developing Group Relative Policy Optimisation (GRPO), a reinforcement learning method used to train its reasoning-focused models, including DeepSeek-R1.
That model drew widespread attention for delivering strong reasoning performance with significantly lower training compute, briefly unsettling assumptions across the AI industry and even rippling into public markets.
Last month, DeepSeek launched two new reasoning-first AI models, DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, expanding its suite of systems for agents, tool-use and complex inference.
The models introduce an expansion of DeepSeek’s agent-training approach, supported by a new synthetic dataset spanning more than 1,800 environments and 85,000 complex instructions.
The company stated that V3.2 is its first model to integrate thinking directly into tool use, allowing structured reasoning to operate both within and alongside external tools.
In November, DeepSeek released DeepSeekMath-V2, becoming one of only three AI labs—alongside OpenAI and Google DeepMind—to achieve a gold-medal-level score on the International Mathematical Olympiad (IMO) 2025 benchmark.
ALSO READ:Xiaomi Enters AI Race With Open-Source Model MiMo-V2-Flash