Summary
- StepAudio 2.5 Realtime is a real-time speech model that offers fully customizable personas in both Chinese and English.
- In tests conducted in April 2026, StepFun's technology outperformed competitors, including GPT Realtime 1.5 and Gemini Live, across all five voice AI benchmarks.
- The model was developed using a dataset comprising a million personas and fine-tuned with roleplay-focused reinforcement learning to address the common issue of maintaining character under stress.
This week, StepFun, an AI lab based in Shanghai, introduced StepAudio 2.5 Realtime. This model provides a seamless end-to-end solution for voice processing, converting audio inputs directly to audio outputs without any intermediate text conversion. It effectively operates in both Chinese and English, and its performance appears to be impressive based on benchmark results.
StepFun is recognized for creating text-based large language models (LLMs) that outperform significantly larger alternatives. Earlier this year, their Step 3.5 Flash model, featuring 196 billion parameters, excelled in four reasoning benchmarks against competitors with trillion-parameter systems. (In AI, parameters are crucial as they determine the model's knowledge capacity and overall performance.)
The company aims to enhance roleplay experiences, particularly during extended interactions.
Addressing Character Drift
AI persona systems often struggle with a specific issue known as out-of-character (OOC) behavior, where the AI deviates from its established persona under challenging circumstances. This is a well-known limitation across various AI models, as they tend to lose track of their character during prolonged interactions.
StepFun asserts that it has addressed this issue through roleplay-specific reinforcement learning from human feedback (RLHF), which focuses on maintaining persona stability rather than just improving overall quality. The training process began with over 10,000 human-created persona seeds, which were then algorithmically expanded to create a dataset of one million features.
The goal is to ensure that the training data is diverse enough to handle even unusual or complex conversations without disrupting the model's character.
Additionally, StepAudio boasts advanced paralinguistic understanding, enabling it to interpret non-verbal acoustic signals, such as vocal tempo, emotional tone, and age, before generating a response.
In the paralinguistic comprehension benchmark—a test that evaluates the model's ability to perceive emotional nuances and speaking rates on a scale of 0 to 100—StepAudio achieved a score of 82.18. In comparison, GPT Realtime 1.5 scored 80.46, Gemini Live received 58.05, and DouBao Realtime had a score of 16.09.
For the human evaluation benchmark, where real users interacted with the model via a mobile app and were rated by human judges on a scale of 0 to 100, StepAudio scored 80.41, while GPT Realtime 1.5 and Gemini Live scored 68.01 and 67.16, respectively. The general dialogue quality assessment, measured objectively via API on the same 0 to 100 scale, resulted in StepAudio scoring 86.36, compared to GPT's 81.60.
These benchmarks are provided by StepFun itself, and while skepticism may arise, the results in paralinguistics and interactive dialogue are significant enough to warrant attention.
Background on StepFun
Founded in April 2023 by Jiang Daxin, who spent 16 years at Microsoft overseeing projects like Bing, Cortana, and Azure cognitive services, StepFun is classified as one of China's AI Tiger startups and has raised approximately $1.7 billion to date.
OpenAI's advanced voice technology, which launched in late 2024, set the standard that competitors are striving to meet. StepFun is now directly benchmarking itself against this technology and claims to have achieved superior results.
The launch features a flagship AI persona named Xiao Yue, which StepFun describes as a "soul-level companion" designed to replicate the experience of chatting with a friend rather than interacting with software. Users can customize aspects like opinions, catchphrases, and emotional boundaries.
Developers have the option to create their own personas through the API, with full documentation available at platform.stepfun.com, and the model is currently operational.
