Google has introduced Gemma 4 12B, a new open-weight multimodal model designed to run locally on consumer hardware while supporting text, image and audio inputs through a single unified architecture.
The model sits between Google’s smaller E4B model and its larger 26B Mixture-of-Experts (MoE) system, offering what the company describes as near-26B benchmark performance at less than half the memory footprint.
Google said Gemma 4 models have now surpassed 150 million downloads across the developer community.
Gemma 4 12B is the first mid-sized Gemma model to include native audio capabilities. According to Google, the model can run locally on laptops equipped with 16GB of VRAM or unified memory, targeting multimodal reasoning tasks, agentic workflows and offline AI applications.
A key technical change is the model’s encoder-free multimodal architecture. Most multimodal systems use dedicated vision and audio encoders that convert inputs into embeddings before passing them to the language model. Gemma 4 12B removes those components and processes visual and audio inputs directly within the model backbone.
For vision, Google replaced the traditional vision encoder with a lightweight embedding module consisting of a matrix multiplication layer, positional embeddings and normalisation steps. For audio, the company eliminated the audio encoder entirely, projecting raw audio signals into the same token space used by text.
“What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs,” Google said. The company said removing separate encoders reduces memory requirements and latency while simplifying deployment on local hardware.
The model also includes Multi-Token Prediction (MTP) drafters, which generate multiple future tokens simultaneously to reduce inference latency. Google is releasing Gemma 4 12B under the Apache 2.0 licence and making it available through tools including LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, SGLang and vLLM.
Alongside the model release, Google is launching a new Gemma Skills Repository that contains reusable agent components for developers building applications on top of Gemma models. Production deployment options include Google Cloud’s Gemini Enterprise Agent Platform, Model Garden, Cloud Run and GKE.
The launch extends Google’s recent push into smaller and more efficient AI models.
In recent months, the company introduced Gemini 3.1 Flash-Lite, a model aimed at high-volume developer workloads with lower latency and cost, while continuing to expand the Gemini Flash family for production inference.
Gemma 4 12B also follows the broader Gemma 4 rollout, which introduced E4B and 26B variants focused on multimodal reasoning and agentic workflows, as Google increases its investment in models that can run directly on laptops, phones and other edge devices.
ALSO READ: Alteryx Inspire 2026: Three Questions Every Data Leader Should Take to Orlando
