Cursor Achieves 1.8x Inference Speedup on NVIDIA B200 GPUs

Cursor said they arrived by thinking about what the maximum achievable memory bandwidth for MoE decode on Blackwell actually is.

Share

Cursor has introduced a new inference technique, “warp decode,” that restructures how Mixture-of-Experts (MoE) models execute during token generation, reporting a 1.84x throughput improvement on NVIDIA Blackwell GPUs.

The approach targets a specific inefficiency in autoregressive decoding: models generate one token at a time, and traditional batching strategies become ineffective.

As the company explained in a blog, “We arrived by thinking about what the maximum achievable memory bandwidth for MoE decode on Blackwell actually is.”

MoE models route each token through a subset of specialised neural networks, typically selecting a small number of experts at each layer. Conventional implementations organise computation around these experts, grouping tokens, executing matrix operations, and recombining results. While effective at scale, this structure becomes inefficient during decoding.

“Five of the eight stages in the traditional path exist purely to manage data layout for the expert-centric view and perform no actual computation,” Cursor noted, highlighting how much of the pipeline is spent on overhead rather than useful work.

‘Warp decode’ replaces this expert-centric structure with an output-centric execution model. Instead of assigning GPU work units to experts, each warp—a group of 32 parallel processing lanes—computes a single output value.

“Each warp is assigned exactly one output value to compute,” the company posted. These warps independently stream weights, aggregate contributions across routed experts in registers, and write results directly, eliminating intermediate buffers and cross-warp coordination.

This allows the pipeline to run “without any staging, handoffs, cross-warp sync points, or intermediate buffers,” compressing the MoE layer into two fused kernels and significantly reducing memory traffic.

In testing on NVIDIA B200 hardware, the pipeline sustained 3.95 TB/s of memory bandwidth at a batch size of 32, or about 58% of the hardware’s theoretical peak. The remaining gap reflects structural limits in MoE workloads rather than implementation inefficiencies.

“The remaining gap likely reflects the memory latency cost of the random access patterns that expert routing creates,” Cursor posited. Because each warp operates independently, the GPU scheduler can hide memory latency by switching execution between thousands of concurrent warps.

The technique also improves numerical fidelity, not by altering the model itself but by changing how computations are performed. Traditional pipelines introduce rounding errors due to repeated precision conversions, whereas warp decode keeps activations at higher precision and accumulates results in FP32 registers.

Cursor characterised this as a rare combination of benefits, stating, “Kernels that improve both performance and accuracy are rare, and warp decode is one of them.”

The code editor said the approach is not a universal replacement for existing MoE execution strategies. Expert-centric batching remains more efficient for prefill and large-batch inference, where overhead can be amortised.

However, warp decode is instead designed for decode-heavy workloads with limited shared computation, improving throughput per GPU and accelerating internal model iteration cycles.

ALSO READ: Cursor Launches Composer 1.5

Staff Writer
Staff Writer
The AI & Data Insider team works with a staff of in-house writers and industry experts.

Related

Unpack More