Google has released new software architecture for its Gemma 4 AI models, designed to speed up text generation by up to 300%.
The company launched drafters for Multi-Token Prediction (MTP) to address memory-bandwidth bottlenecks that typically slow processing.
The Gemma 4 open-weights model family, released last month, has recorded 60 million downloads in its first few weeks, and provides open-source AI tools for developers building applications.
Standard LLMs generate text one token at a time, which forces processors to repeatedly move billions of parameters from memory to compute units. Google’s update uses speculative decoding to decouple token generation from verification. The architecture pairs a heavy target model, such as the 31-billion-parameter Gemma 4, with a lightweight drafter.
This smaller drafter predicts multiple future tokens simultaneously using idle computational power. The larger target model then checks these suggestions in a single parallel pass.
Google’s Product Management Director Olivier Lacombe and Developer Relations Engineer Maarten Grootendorst noted in a blog post that standard autoregressive generation allocates identical processing power to obvious text continuations “as it does to solving a complex logic puzzle.”
The speed improvements apply across different hardware environments. Processing multiple requests simultaneously yields a 2.2x speed increase on Apple Silicon chips for the 26-billion-parameter mixture-of-experts model.
The update also optimises the Gemma 4 E2B and E4B edge models through clustering techniques, preserving battery life on mobile devices. Because the primary model handles final verification, developers maintain “identical frontier-class reasoning and accuracy, just delivered significantly faster.”
To achieve these results, Google engineered the draft models to share the target model’s cache and utilise its existing calculations. This architectural change ensures the drafters avoid operations that would “have to waste time recalculating context the larger model has already figured out.”
The MTP drafters are available under the Apache 2.0 open-source licence. Engineers can download model weights from repositories such as Hugging Face and Kaggle. Open-source development platforms like SGLang have also announced day-zero support for the update.
The software integrates directly with frameworks such as MLX and Ollama, enabling developers to immediately deploy faster autonomous agents and real-time coding assistants across consumer hardware.
ALSO READ: The Playground is Closed: 10 Hard Truths from the Cisco AI Summit
