Summary
- DeepSeek has made its 75% discount on the V4-Pro model permanent as of May 22, bringing the cost down to $0.87 per million tokens.
- Xiaomi reduced prices for its MiMo-V2.5 API by as much as 99% on May 26, with the Pro model's cached input priced at $0.0036 per million tokens.
- In contrast, OpenAI's GPT-5.5 has raised its output costs to $30 per million tokens upon release, while Anthropic's Claude Opus 4.7 introduced a new tokenizer that could inflate costs by 35%.
DeepSeek has recently solidified its 75% price reduction on the DeepSeek V4-Pro, a move aimed at making AI services more affordable. Concurrently, Xiaomi has dramatically cut the prices of its MiMo-V2.5 API, marking a significant decrease while American companies have opted for price hikes instead.
For those unfamiliar with the technical aspects: Using AI models like ChatGPT or Claude in a browser often requires a flat subscription fee or no payment at all. However, companies building applications on these models incur costs based on token usage, where a token represents approximately three-quarters of a word. Each interaction or document processed contributes to this token tally, which is quantified in millions.
An API serves as the essential conduit that allows different applications to leverage these AI models. Consequently, the pricing of tokens is crucial in determining whether a product utilizing AI is financially sustainable or a drain on resources.
Token pricing plans function like a subscription model where credits are purchased in advance and consumed by the AI model. Xiaomi's recent billing changes enhance user value, providing 5 to 8 times more tokens for the same cost. For instance, the Max plan at $100 now grants access to 82 billion tokens, a significant increase from 1.6 billion.
To illustrate, 82 billion tokens equate to over 60 billion words.
Understanding the rationale behind these substantial cuts
Fuli Luo, the leader of Xiaomi's MiMo team and a former core developer at DeepSeek, shared insights on X. The primary savings stem from a more efficient method of storing and reusing data already processed by the AI. Xiaomi's system can now retain approximately five times more data, significantly reducing the required computing power and leading to an 80% decrease in storage and processing expenses.
Justifying the MiMo API Price Cuts:
Our deepest price reduction, reaching up to 99%, applies specifically to Input (Cache Hit). The key reason for this is our inference framework's support for hierarchical KV cache optimization for SWA. Tests of our production inference engine reveal that this optimization enhances cached token efficiency…— Fuli Luo (@_LuoFuli) May 27, 2026
DeepSeek's architecture achieves similar outcomes through different mechanisms. The V4 model employs two types of interleaved attention: one that compresses every four tokens for selective attention and another that simplifies every 128 tokens for broader context, optimizing for minimal computing. With a context capacity of one million tokens, V4-Pro's KV cache is only 10% of the size of its predecessor, and single-token inference operates at just 27% of the previous computational cost.
This leads to a model that is 98% less expensive than GPT-5.5 Pro, while still delivering competitive performance.
Silicon Valley's pricing strategy
Claude Opus 4.7 has set its pricing at $5 per million input tokens and $25 per million output tokens. While Anthropic maintained the same rates, it introduced a new tokenizer that can yield up to 35% more tokens from identical input text, meaning costs could potentially rise despite unchanged rates.
Meanwhile, GPT-5.5, launched in late April, has doubled the output pricing to $30 per million tokens. Gemini 2.5 Pro is priced at $1.25 for input and $10 for output, which is relatively inexpensive by American standards.
DeepSeek V4-Pro, a model with 1.6 trillion parameters, offers extensive knowledge at a significantly reduced compute cost, now permanently priced at $0.435 for input and $0.87 for output per million tokens. This model achieved an 80.6% score on SWE-Verified, closely trailing Claude Opus 4.6's 80.8% performance on a benchmark assessing actual GitHub issue resolution, rather than selective showcases. The disparity in output pricing for models with nearly identical coding scores is a staggering 34 times.
MiMo-V2.5-Pro aligns with the same pricing of $0.435/$0.87 per million tokens following the recent cuts, with cache hits priced at $0.0036. This rate is even lower per token than what most people pay per character in a text message.
DeepSeek and Xiaomi are not isolated in this trend
These recent price reductions come in a landscape where Chinese models were already significantly cheaper prior to these announcements. The MiniMax M2.7, which competes effectively with Claude Opus in coding benchmarks according to Artificial Analysis, is priced at $0.30 for input tokens and $1.20 for output tokens per million, roughly 5% of Opus 4.7's output cost.
Moonshot AI's Kimi K2.5, scoring 76.8% on SWE-bench Verified, is offered at $0.60 for input and $2.50 for output. Z.AI's GLM-5.1 recently outperformed Claude Opus 4.6 on a significant coding benchmark. Four Chinese frontier models were launched within a short span of 12 days in early May, all priced below one-third of Opus 4.7's cost per token.
For better clarity, the following chart illustrates how the pricing of Chinese models compares to the three leading American AI providers (Anthropic, OpenAI, and Meta) in terms of price-to-quality ratio.
Image: Artificialanalysis.aiThe disparity in Q2 2026 between Chinese and American frontier models ranges from 15 to 30 times, depending on the models compared, and this is just the baseline, excluding any cache discounts.
This week's pricing adjustments further narrow that gap for specific workloads that are actively utilized in production, such as agent pipelines with consistent system prompts, document processing, and retrieval tools that frequently access cache. With a price of $0.003625 per million cached input tokens, DeepSeek V4-Pro's cost for repeated context usage is practically negligible.
