Enter your email address below and subscribe to our newsletter

Google releases DiffusionGemma, an open AI model that generates text in parallel

Share your love

  • Google DeepMind released DiffusionGemma, an open-source model under Apache 2.0 that uses diffusion-based decoding to generate text blocks simultaneously.letsdatascience
  • The 26B Mixture of Experts model activates only 3.8B parameters during inference, fitting within 18GB VRAM on consumer GPUs when quantized.google
  • Google says the model trails Gemma 4 on benchmark quality, targeting researchers exploring speed-critical tasks like code infilling and in-line editing.letsdatascience

Google DeepMind Launches DiffusionGemma, an Open Model That Generates Text in Parallel

Google DeepMind on Tuesday released DiffusionGemma, an experimental open-source language model that abandons the one-token-at-a-time approach of conventional AI systems in favor of generating entire blocks of text simultaneously, delivering what the company says is up to four times faster inference on dedicated GPUs.

A New Architecture for Speed

Unlike standard autoregressive models that produce text sequentially, DiffusionGemma uses a diffusion-based decoder that drafts 256 tokens in parallel with each forward pass, shifting the inference bottleneck from memory bandwidth to raw compute. The model is a 26-billion-parameter Mixture of Experts design that activates only 3.8 billion parameters during inference, allowing it to fit within the 18GB VRAM of high-end consumer GPUs when quantized.letsdatascience

Google reports throughput of more than 1,000 tokens per second on a single Nvidia H100 accelerator and over 700 tokens per second on an Nvidia GeForce RTX 5090. The company framed those figures as roughly four times the output speed of comparably sized autoregressive Gemma models in single-user workloads.blog

Nvidia Optimization and Availability

Google said it worked directly with Nvidia to optimize DiffusionGemma across the chipmaker’s hardware lineup, from consumer GeForce RTX 4090 and 5090 GPUs to enterprise Hopper and Blackwell systems, including DGX Spark and DGX Station for local deployment. Native support for Nvidia’s NVFP4 four-bit floating-point format accelerates compute with what Google described as “near-lossless accuracy.”blog

The model weights are available under an Apache 2.0 license on Hugging Face, and developers can also access DiffusionGemma through Nvidia NIM. Serving is supported via vLLM, MLX, and Hugging Face Transformers.google

Trade-offs and Target Use Cases

Google was explicit that DiffusionGemma trails its standard Gemma 4 models on overall output quality, positioning the release as experimental and aimed at researchers exploring speed-critical workflows such as in-line editing, code infilling, and generating non-linear text structures. The parallel decoding advantage also diminishes under high-concurrency cloud serving, where autoregressive models can already saturate hardware through request batching.letsdatascience

The release builds on Google’s broader diffusion-for-text research, which began publicly with the Gemini Diffusion demo in May 2025, and extends the Gemma 4 open model family that launched earlier this month.sunbposolutions

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay informed and not overwhelmed, subscribe now!