Summary
- Xiaomi, in collaboration with TileRT, has achieved a breakthrough by exceeding 1,000 tokens per second on a 1-trillion-parameter model using a standard 8-GPU commodity node, marking a significant milestone.
- This enhanced speed is a result of employing FP4 quantization within the model's expert layers and DFlash speculative decoding, allowing for processing multiple tokens simultaneously.
- A limited API trial will be available from June 9 to June 23, priced at three times the standard MiMo rates while offering nearly ten times the generation speed.
Often recognized primarily as a budget smartphone manufacturer, Xiaomi has now made headlines by achieving a remarkable AI inference speed record with its latest release.
The company has unveiled MiMo-V2.5-Pro-UltraSpeed, a version of its flagship model that can surpass 1,000 tokens per second, reaching nearly 1,200 in demonstrations.
In AI models, parameters are the internal numerical weights that determine a model's capabilities; the greater the number, the more intricate the patterns it can identify. Tokens represent segments of text the model processes, averaging about three-quarters of a word each.
This achievement was made possible on a single 8-GPU commodity node, using standard hardware rather than specialized chips, which expands the potential for widespread deployment of such rapid speeds.
For context, according to Artificial Analysis, GPT-5.5, commonly used by ChatGPT, operates at around 68 tokens per second, while Claude Opus 4.6 achieves about 71. In comparison, Haiku reaches approximately 98 tokens per second, and Gemini Flash processes 192 tokens per second. In contrast, MiMo-V2.5-Pro-UltraSpeed operates at 1,000 tokens per second, performing comparably to Opus in coding tasks.
Companies like Cerebras and Groq have built their businesses around enhancing inference speeds. Cerebras developed a chip the size of a dinner plate, capable of 44GB of on-chip memory, which allows it to achieve 969 tokens per second using Meta's Llama 3.1 405B, which is less than half the size of MiMo-V2.5-Pro. Groq's specialized Language Processing Unit architecture can reach between 300 to 750 tokens per second, depending on the model used.
However, neither of these solutions operates on hardware that can be easily rented from AWS.
Xiaomi's achievement relies solely on commodity GPUs and software innovations, specifically a combination of model optimizations and a specialized inference engine named TileRT.
Technical Innovations Behind the Speed
The impressive speed is attributed to two primary techniques. The first is FP4 Quantization, which reduces the precision of the expert layers—constituting the majority of the trillion parameters—to 4-bit instead of the standard 8 or 16-bit precision. This reduction decreases memory usage, alleviates bandwidth strain, and enhances speed, with minimal degradation in quality. Xiaomi's approach ensures that only expert layers are compressed while other components remain at full precision, resulting in negligible quality loss.
The second technique, DFlash speculative decoding, improves upon traditional speculative decoding methods. Typically, a smaller draft model predicts a few tokens, which are then verified by the larger model. DFlash bypasses this sequential process and instead fills an entire block of masked positions in one go. In coding tasks, the primary model verifies an average of 6.3 out of 8 proposed tokens in a single step, confirming multiple tokens simultaneously.
TileRT integrates these processes, maintaining the entire computation pipeline within the GPU, eliminating delays from operator launches and ensuring continuous execution.
Xiaomi describes this method as "extreme model-system codesign," accurately reflecting that neither technique alone achieves the 1,000 tokens per second mark, but their combined effect does.
The MiMo-V2.5-Pro is a cutting-edge model. The launch of V2.5 Pro was covered in April, demonstrating performance comparable to Claude Opus in coding tasks and operating at approximately $0.43 per million tokens for input and $0.87 for output, while Opus charges $5 for input and $25 for output per million tokens.
UltraSpeed enhances the original MiMo V2.5 Pro model without any reductions in capabilities.
The rapid inference capability transforms potential applications of the model, enabling simultaneous processing of multiple reasoning paths rather than sequentially waiting for a single response. This is crucial for applications such as fraud detection, trading signal generation, and real-time agent loops, which require low latency that 60 tokens per second cannot fulfill. At 1,000 tokens per second, these demands can be met.
Xiaomi is offering this enhanced speed at three times the standard MiMo-V2.5-Pro rate, delivering approximately ten times the output. The API trial is set for June 9–23, with applications prioritized for enterprise and professional developers. The FP4-DFlash checkpoint has already been open-sourced on Hugging Face for community experimentation.
