Cursor Achieves 1.8x Inference Speedup on NVIDIA B200 GPUs

Cursor has introduced a new inference technique, “warp decode,” that restructures how Mixture-of-Experts (MoE) models execute during token generation, reporting a 1.84x throughput improvement on NVIDIA Blackwell GPUs.

The approach targets a specific inefficiency in autoregressive decoding: models generate one token at a time, and traditional batching strategies become ineffective.

As the company explained in a blog, “We arrived by thinking about what the maximum achievable memory bandwidth for MoE decode on Blackwell actually is.”

MoE models route each token through a subset of specialised neural networks, typically selecting a small number of experts at each layer. Conventional implementations organise computation around these experts, grouping tokens, executing matrix operations, and recombining results. While effective at scale, this structure becomes inefficient during decoding.

“Five of the eight stages in the traditional path exist purely to manage data layout for the expert-centric view and perform no actual computation,” Cursor noted, highlighting how much of the pipeline is spent on overhead rather than useful work.

‘Warp decode’ replaces this expert-centric structure with an output-centric execution model. Instead of assigning GPU work units to experts, each warp—a group of 32 parallel processing lanes—computes a single output value.

“Each warp is assigned exactly one output value to compute,” the company posted. These warps independently stream weights, aggregate contributions across routed experts in registers, and write results directly, eliminating intermediate buffers and cross-warp coordination.

This allows the pipeline to run “without any staging, handoffs, cross-warp sync points, or intermediate buffers,” compressing the MoE layer into two fused kernels and significantly reducing memory traffic.

In testing on NVIDIA B200 hardware, the pipeline sustained 3.95 TB/s of memory bandwidth at a batch size of 32, or about 58% of the hardware’s theoretical peak. The remaining gap reflects structural limits in MoE workloads rather than implementation inefficiencies.

“The remaining gap likely reflects the memory latency cost of the random access patterns that expert routing creates,” Cursor posited. Because each warp operates independently, the GPU scheduler can hide memory latency by switching execution between thousands of concurrent warps.

The technique also improves numerical fidelity, not by altering the model itself but by changing how computations are performed. Traditional pipelines introduce rounding errors due to repeated precision conversions, whereas warp decode keeps activations at higher precision and accumulates results in FP32 registers.

Cursor characterised this as a rare combination of benefits, stating, “Kernels that improve both performance and accuracy are rare, and warp decode is one of them.”

The code editor said the approach is not a universal replacement for existing MoE execution strategies. Expert-centric batching remains more efficient for prefill and large-batch inference, where overhead can be amortised.

However, warp decode is instead designed for decode-heavy workloads with limited shared computation, improving throughput per GPU and accelerating internal model iteration cycles.

Join Our Core Community

CEOs, AI and the New Burden of Knowing Enough

Why Data Sovereignty Is Becoming an Enterprise AI Control Problem

This Startup Went from a Team of 20 to 6. Yet, Humans are their Most Valued Asset.

From Generic Models to Living Twins: A Practitioner’s Guide to ML in Design Workflows

Designing AI‑Ready Public Infrastructure: Global Lessons from India’s Aadhaar Builder

Banks Are Drowning in Data and Starving for Insight

Unstructured Data, Deterministic Answers

Data Layer Precedes Compute, GPU Capacity in Sovereign AI

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

OpenAI Launches ChatGPT Work Powered by GPT-5.6 for Enterprise Workflows

MiniMax Announces New $2 Bn Funding

Meta Launches Muse Spark 1.1 Challenges GPT-5.5 & Opus 4.8

Father of Reinforcement Learning Richard Sutton Launches New AI Startup

SpaceXAI Launches Grok 4.5

Cursor Achieves 1.8x Inference Speedup on NVIDIA B200 GPUs

Cursor said they arrived by thinking about what the maximum achievable memory bandwidth for MoE decode on Blackwell actually is.

OpenAI Launches ChatGPT Work Powered by GPT-5.6 for Enterprise Workflows

MiniMax Announces New $2 Bn Funding

Unpack More

NVIDIA Introduces Revenue-Sharing Model for AI Cloud Partners

Palantir to Bring NVIDIA’s Nemotron Models to Sovereign Environments

SpaceX to Acquire Cursor in $60 Billion All-Stock Deal

NVIDIA, LG Group to Build AI Factory for Robotics, Mobility, Data Centres

Why Data Reliability Now Governs Scaling GenAI

Middle East: The Sovereign AI Testbed US, EU and Asia Can Learn From

NVIDIA’s VP of Solutions Architecture on What It Actually Takes to Build a Sovereign AI Factory