Google launches encoder-free Gemma 4 12B AI model

Google ↗1.48% released Gemma 4 12B on Tuesday, the first mid-sized open-weight model to handle text, images, and audio without dedicated encoders.aiweekly
The model replaces conventional vision and audio encoder layers with lightweight projections, cutting latency and fitting within 16GB of VRAM.aiweekly
Licensed under Apache 2.0, it launched with day-one support across llama.cpp, vLLM, MLX, Ollama, LM Studio, and Unsloth.aiweekly

Google Launches Gemma 4 12B, an Encoder-Free Multimodal Model That Runs on Consumer Laptops

Google DeepMind released Gemma 4 12B on Tuesday, an open-source multimodal model that processes text, images, and audio without relying on dedicated encoders — a first for a mid-sized open-weight model. The 12-billion-parameter model fits within 16GB of VRAM or unified memory, putting multimodal AI inference on standard consumer hardware.

The release fills a gap in the Gemma 4 family, which launched in April with four variants ranging from edge-optimized E2B and E4B models to the larger 26B Mixture of Experts and 31B Dense configurations. While those earlier models relied on vision transformer layers and conformer-based audio encoders, the new 12B variant eliminates both in favor of what Google calls a “Unified” architecture.vllm

How the Encoder-Free Design Works

In conventional multimodal models, separate encoder modules process images and audio before passing representations to the language model backbone. Gemma 4 12B replaces the entire vision encoder — typically 15 to 27 transformer layers — with a lightweight 35-million-parameter embedding module that projects raw pixel patches directly into the LLM’s token space using a single matrix multiplication with factorized 2D positional embeddings. Audio follows a similar path: raw 16 kHz waveforms in 40-millisecond frames are projected directly into the same dimensional space as text tokens, bypassing any separate speech recognition encoder.aiweekly

The practical result is reduced latency, since the LLM can begin processing inputs without waiting for encoder pipelines to finish. It also simplifies fine-tuning — a single LoRA pass can update vision, audio, and text weights simultaneously.maartengrootendorst

Performance and Availability

Google says the 12B model approaches the performance of the larger 26B MoE variant on standard benchmarks at less than half the memory footprint. Reported scores include 77.2% on MMLU Pro and 78.8% on GPQA Diamond.reddit

The model is released under an Apache 2.0 license — a commercially permissive open-source license that Google first adopted for the Gemma family with the April Gemma 4 release. Day-one support spans llama.cpp, vLLM, MLX, Ollama, LM Studio, and Unsloth.aiweekly

Local Desktop Experience

The release coincides with continued expansion of Google’s local-first tooling for macOS. An open-source Electron application called Gemma Chat, which runs Gemma 4 models locally on Apple Silicon Macs through Apple’s MLX framework, supports the new 12B variant alongside earlier models. The app offers both a coding agent mode and a conversational mode with voice input powered by local speech-to-text, keeping all prompts and generated content on-device.youtube