Summary

  • Mercury 2 from Inception Labs generates around 1,000 tokens each second, achieving a score of 90 on the AIME 2026.
  • Google's DiffusionGemma, while achieving similar speeds, has lower benchmark performance.
  • DiffusionGemma is available for free and is open-weight on Hugging Face, whereas Mercury 2 is a paid, closed-weight API model.

Inception Labs unveiled Mercury 2 on Thursday, branding it the fastest reasoning language model globally. According to the company's announcement, it produces approximately 1,000 tokens per second, significantly faster than Anthropic’s Claude Haiku 4.5 Reasoning at about 89 tokens per second and OpenAI’s GPT-5 Mini at 71 tokens per second.

This performance aligns with the speed claims made by Google for its DiffusionGemma.

Welcome to the diffusion era.

We embraced parallel generation years back when it was considered unconventional. It’s exciting to see the industry evolve.

Mercury 2 leads the Pareto frontier for quality, speed, and cost among publicly available diffusion LLMs. pic.twitter.com/qSHuiR7vmH

— Inception (@_inception_ai) June 18, 2026

Both models utilize a different writing approach. Traditional chatbots generate one word at a time, verifying each before proceeding to the next, which can slow down the process. In contrast, diffusion models fill a text block with random placeholder tokens and refine the output through multiple parallel passes, similar to how image generators like Stable Diffusion create images from static noise.

The models differ in their final output quality. On the AIME 2026, which comprises real problems from the American Invitational Mathematics Examination and is scored based on the percentage of correct answers, Mercury 2 achieved a score of 90%. In contrast, DiffusionGemma scored 69.1% on the same test, while the standard non-diffusion Gemma 4 scored 88.3%.

In a PhD-level science benchmark, GPQA, the models were nearly tied, with Mercury 2 at 77% and DiffusionGemma at 73.2%. However, Google's developer guide suggests using the standard Gemma 4 for applications requiring the highest quality, acknowledging that DiffusionGemma falls short.

The speed advantage is also evident in practical applications. Augment Code, an AI coding-agent firm, replaced Anthropic's Claude Opus 4 with Mercury 2 for its context-compaction subagent, resulting in an 82% reduction in latency and a 90% decrease in costs while maintaining output quality, as shown in a joint case study.

Inception Labs was founded based on research by Stefano Ermon, a Stanford professor who co-developed score-based diffusion techniques underlying modern image generators. The startup recently secured $50 million in funding, with support from Nvidia's venture arm and investors like Andrew Ng and Andrej Karpathy.

For non-technical users, a significant benefit of this technology is the improved "flow" experience. Traditional models often create delays between thoughts in lengthy sessions, while diffusion models provide a more seamless interaction, enabling instant autocompletion, quick iterations on code or plans, and sub-agents that manage repetitive tasks without hindering overall performance.

The introduction of sub-agent architecture marks a notable shift. Rather than functioning as a single, complex AI, systems now consist of multiple specialized helpers: one for deep reasoning, others for quick summarization, routing, tool lookup, and output verification. Sequential models make these utility calls costly and slow, while parallel diffusion models render them inexpensive and fast, allowing for more frequent use.

However, it’s important to note that these models are best suited for speed-sensitive, high-volume tasks rather than the most challenging reasoning scenarios where larger AR models may still have an advantage. Mercury 2 operates as a paid API/cloud model, unlike Google's offering, and the broader ecosystem (including local runtimes and agent frameworks) is still developing to ensure seamless integration.

Immediate applications include real-time programming and "vibe coding" where the model adapts to user edits, multi-agent coding systems with numerous rapid sub-calls, responsive voice interfaces, and any latency-sensitive tasks requiring autocomplete or next-action predictions. At scale, the efficiency and energy savings from increased output on standard hardware can be substantial.

The data shared by Inception (and independent evaluations) visually support the narrative: Mercury 2 occupies the "fast and good" quadrant for diffusion models, making previously demanding hardware accessible to commodity GPUs.

Daily Debrief Newsletter

Stay updated every day with the latest news stories, plus original features, podcasts, videos, and more.