Summary

  • Google unveiled DiffusionGemma, a free open-weight AI model capable of generating complete 256-token blocks simultaneously through text diffusion, achieving speeds exceeding 1,000 tokens per second on an NVIDIA H100, which is four times quicker than typical autoregressive models.
  • However, the necessary custom drafter module for local inference is currently unavailable in any public runtime, including mlx-lm and LM Studio, rendering it impractical for most consumer systems at this time.
  • On NVIDIA NIM, the model is preconfigured with a context of 8,192 tokens, which is below the 64,000 tokens required by frameworks like Hermes Agent, meaning that autonomous workflows cannot be executed without manual adjustments.

Google has released DiffusionGemma, an open AI model that generates text similarly to how image generators create visuals: by starting with noise and refining it until clarity is achieved. It operates at a remarkable rate of 1,000 tokens per second on an NVIDIA H100. (Tokens represent the fundamental units of information processed by an AI model.) This makes it four times quicker than the standard Gemma model. Additionally, it is available for free under the Apache 2.0 license, with weights hosted on Hugging Face.

However, there are caveats. According to Google's announcement, the model achieves "700+ tokens per second on NVIDIA GeForce RTX 5090." Moreover, it does not match the output quality of standard Gemma 4.

Google acknowledges that this model prioritizes speed rather than quality improvements.

Understanding its Functionality

All large language models (LLMs) operate like typewriters, generating one token at a time, with each word reliant on the preceding one. This is the mechanism of autoregressive architectures.

DiffusionGemma, however, operates differently. Rather than producing tokens in sequence, it begins with disordered text segments and processes them in parallel. According to Google's developer guide, it "starts with a canvas of random placeholder tokens" and gradually solidifies confident tokens until the entire block is clearly defined. The model generates 256 tokens in each forward pass, keeping the GPU engaged throughout the process.

This method allows for bidirectional attention, enabling each token to access all others during generation—something autoregressive models cannot do since they lack foresight into future encodings. This capability enhances performance in scenarios where the conclusion of the output influences its beginning, such as code completion, structured output, and other constraint-heavy tasks. For demonstration purposes, Google fine-tuned a version to solve Sudoku puzzles, achieving around 0% accuracy with the base model and 80% with the refined version.

Text diffusion has been an area of research for several years. Models like MDLM, SEDD, and LLaDA have shown the methodology's viability at smaller scales but largely remained as proofs of concept. In February 2026, Inception Labs launched Mercury 2, the first commercial diffusion reasoning model, claiming speeds five times that of competitors optimized for speed.

Nonetheless, none of these models were open-weight, nor did they offer immediate support in vLLM, Hugging Face Transformers, or Unsloth. DiffusionGemma marks the first significant open release from a leading lab.

Interestingly, there is a historical twist: image generators began as diffusion models (hence the term Stable Diffusion) and are now transitioning to autoregressive architectures for improved quality, while language models that started as autoregressive are now exploring diffusion techniques for enhanced speed.

Challenges in Running the Model… For Now

To run DiffusionGemma efficiently, a drafter is required—a lightweight module that suggests token blocks in parallel, later validated by the main model in a single forward pass. This process is known as speculative decoding. The framework DFlash, introduced in early 2026, utilizes a small diffusion model as its drafter, achieving over 6x speedup for specific tasks, making this class of model practical.

The issue is that DiffusionGemma requires a specific drafter to operate locally via MLX—Apple's machine learning framework for Apple Silicon. Currently, this necessary module is absent from any public version of mlx-lm, open pull requests, or bundled runtime in LM Studio.

Attempts to run DiffusionGemma with Hermes through NVIDIA NIM were met with failure, with the model loading but returning the error: "agent init failed: Model google/diffusiongemma-26b-a4b-it has a context window of 8,192 tokens, which is below the minimum 64,000 required by Hermes Agent."

To clarify, DiffusionGemma's actual context window is 256K tokens; the 8,192 figure was a default misconfiguration by Nvidia, not a limit of the model's architecture.

In practice, proper configuration for agentic use necessitates manual adjustments that many users have yet to master, and without these adjustments, Hermes Agent will not initialize. The benefit of parallel speed is negated if the agent cannot start.

It’s hoped that in the coming days, the community will produce improved resources to facilitate running these models.

Target Audience for This Model

This model is designed for developers equipped with NVIDIA RTX 4090 or 5090 hardware who are creating real-time tools such as inline editors, autocompletion, and structured generation. This aligns with Google's ongoing efforts to enhance local inference speed without the need for new hardware, as reported by Decrypt in May.

For researchers, the bidirectional generation capability opens new avenues that autoregressive models cannot access—such as protein sequences and mathematical graphs, where the current position is influenced by a distant one. This is a significant advancement.

In April, Google released Gemma 4 under Apache 2.0, continuing this open-source strategy with DiffusionGemma. A draft pull request for llama.cpp has already been initiated today. Once the toolchain is fully developed, this technology will reach a broader audience.

When run on a capable discrete GPU, the performance of 1,000 tokens per second is indeed achievable.

Subscribe to the Daily Debrief Newsletter

Stay updated with the latest news stories, along with original features, podcasts, videos, and more.