Alibaba Qwen-Robot Suite: AI Models for Robotics

Alibaba has launched the Qwen-Robot Suite, a set of AI models for robotics, including navigation, object manipulation, and scene prediction.

Alibaba has introduced the Qwen-Robot Suite, a set of AI models designed for robots and tasks in physical environments: Qwen-RobotNav for navigation, Qwen-RobotManip for object manipulation, and Qwen-RobotWorld for scene prediction. The team describes the project as a "full stack for embodied artificial intelligence."

📣 Introducing the Qwen-Robot Suite — Qwen-RobotNav, Qwen-RobotManip, Qwen-RobotWorld, three foundation models, a full stack for embodied intelligence.
🧭 Qwen-RobotNav — the gateway to mobility.
• Unifies 5 navigation tasks in one model: instruction following, point-goal,… pic.twitter.com/noumjTtTeS
— Qwen (@Alibaba_Qwen) June 16, 2026

These software models are intended to assist physical agents in perceiving their environment, planning actions, and executing commands in natural language. The Qwen-Robot Suite is currently undergoing pilot testing with select corporate clients of Alibaba Cloud in the robotics sector.

Why Alibaba is Bringing Qwen to the Physical World

While large language and multimodal models can handle text, images, video, and speech, this is insufficient for robots. Physical agents need not only to understand commands but also to translate them into movement, taking into account space, object properties, sensor limitations, and the consequences of actions.

Alibaba refers to this area as physical AI, or "embodied AI." In this approach, the model must work not only with digital data but also with the physical environment: moving, locating objects, controlling manipulators, and predicting what will happen after an action.

Qwen-RobotNav: Five Navigation Tasks in One Model

Qwen-RobotNav is responsible for navigation. The model combines five task groups:

following instructions;
moving to a designated point;
locating objects;
tracking a target;
autonomous driving.

According to Alibaba, Qwen-RobotNav is built on the Qwen3-VL foundation and trained on 15.6 million samples related to route planning and visual-language reasoning.

The company reported a 76.5% success rate on VLN-CE RxR and 90% on EVT-Bench. Alibaba also noted that the model can function as a tool for larger agent systems: a higher-level model plans the task while Qwen-RobotNav handles the movement.

Source: Qwen.

In demonstrations, Alibaba describes scenarios such as searching for a lost item indoors or checking if a specific object is open in a building. In these tasks, the robot must not only move but also gather visual evidence and return an answer to the user.

Qwen-RobotManip: Object Manipulation

Qwen-RobotManip is designed for physical actions with objects. The model aims to assist robots in picking up, moving, and placing items, as well as transferring skills between different types of devices.

Source: Qwen-RobotManip.

A key challenge in robotics is that robots describe actions differently. A manipulator, a dual-arm platform, a robot with a hand, or a mobile system use different coordinates, joints, and command formats. Qwen-RobotManip seeks to standardize this data so that training on one type of robot benefits another.

For training, Alibaba utilized over 38,100 hours of data, including 11,320 hours of open robotics data, 1,933 hours of first-person human action videos, and 24,808 hours of synthetic robotic demonstrations created from such videos.

The company claimed that the model ranked first in RoboChallenge Table30 v1 in the universal models track. According to Alibaba, Qwen-RobotManip also demonstrated resilience to new instructions, unfamiliar objects, and skill transfer between different robots.

Qwen-RobotWorld: A World Model for Robots

Qwen-RobotWorld is a video world model driven by natural language. It is designed to predict how a scene will evolve after a given action.

Source: Qwen-RobotWorld.

For instance, the model receives current observations and a text command, then generates a probable future state of the environment. This approach can be used for manipulation, autonomous driving, navigation, planning, and creating synthetic training data for robots.

To train Qwen-RobotWorld, the team compiled the Embodied World Knowledge corpus, which includes 8.6 million "video-text" pairs and over 200 million frames, covering more than 20 types of robotic platforms and over 500 categories of actions.

Alibaba stated that Qwen-RobotWorld ranked first in EWMBench and DreamGen Bench, and outperformed all open models in WorldModelBench and PBench. The technical description also claims that the model shows high consistency with fundamental physical laws — motion, mass conservation, fluids, and gravity.

Still a Long Way to Go for Mass Robotics

Despite the reported results, the Qwen-Robot Suite remains a set of models rather than a ready-to-use consumer robotics platform. Real-world implementation faces challenges such as sensor noise, actuator wear, unexpected situations, perception errors, and a vast number of rare scenarios. Many benchmarks used to compare such systems are conducted in simulations or under limited experimental conditions.

Alibaba has not disclosed the cost of access, the timeline for public launch, or the list of clients currently testing the Qwen-Robot Suite.

As a reminder, in April, Alibaba Cloud introduced the agent model Qwen3.6-Plus with a context window of 1 million tokens and support for external tools.

Alibaba Unveils AI Models for Robotics Management

Why Alibaba is Bringing Qwen to the Physical World

Qwen-RobotNav: Five Navigation Tasks in One Model

Qwen-RobotManip: Object Manipulation

Qwen-RobotWorld: A World Model for Robots

Still a Long Way to Go for Mass Robotics