Google Speeds Up Text Generation for Gemma 4

Google has released new software architecture for its Gemma 4 AI models, designed to speed up text generation by up to 300%.

The company launched drafters for Multi-Token Prediction (MTP) to address memory-bandwidth bottlenecks that typically slow processing.

The Gemma 4 open-weights model family, released last month, has recorded 60 million downloads in its first few weeks, and provides open-source AI tools for developers building applications.

Standard LLMs generate text one token at a time, which forces processors to repeatedly move billions of parameters from memory to compute units. Google’s update uses speculative decoding to decouple token generation from verification. The architecture pairs a heavy target model, such as the 31-billion-parameter Gemma 4, with a lightweight drafter.

This smaller drafter predicts multiple future tokens simultaneously using idle computational power. The larger target model then checks these suggestions in a single parallel pass.

Google’s Product Management Director Olivier Lacombe and Developer Relations Engineer Maarten Grootendorst noted in a blog post that standard autoregressive generation allocates identical processing power to obvious text continuations “as it does to solving a complex logic puzzle.”

The speed improvements apply across different hardware environments. Processing multiple requests simultaneously yields a 2.2x speed increase on Apple Silicon chips for the 26-billion-parameter mixture-of-experts model.

The update also optimises the Gemma 4 E2B and E4B edge models through clustering techniques, preserving battery life on mobile devices. Because the primary model handles final verification, developers maintain “identical frontier-class reasoning and accuracy, just delivered significantly faster.”

To achieve these results, Google engineered the draft models to share the target model’s cache and utilise its existing calculations. This architectural change ensures the drafters avoid operations that would “have to waste time recalculating context the larger model has already figured out.”

The MTP drafters are available under the Apache 2.0 open-source licence. Engineers can download model weights from repositories such as Hugging Face and Kaggle. Open-source development platforms like SGLang have also announced day-zero support for the update.

The software integrates directly with frameworks such as MLX and Ollama, enabling developers to immediately deploy faster autonomous agents and real-time coding assistants across consumer hardware.

ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit

Join Our Core Community

Senior AI Talent is Choosing Stability—and Often that Means Europe

Traceability Shifts Trust from Theoretical to Demonstrated

The AI Infra Deals That Defined Q1 2026

Inside the Orchestration Crisis: Why AI‑Driven Enterprises Need a Control Plane, Not More Tools

AI Sovereignty is Really About Managing Dependencies, Not Going it Alone

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Inside IBM’s 11 Billion Dollar Bet: What the Confluent Deal Reveals About AI’s Investment Paradox

“Synthetic Data Is Not the Ground Truth” — SandboxAQ’s VP of Engineering on Simulation’s Power and Limits

Data as the New Diagnostic: How Ahead Health is Turning Algorithms Into Preventive Care

OpenAI Makes GPT-5.5 Instant as the New Default Model in ChatGPT