Startup Physical Intelligence, founded by former Google engineers, has introduced model π0.7. The developers claim it represents a "qualitative leap" in AI's ability to generalize skills and perform tasks it wasn't directly trained for.

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model! pic.twitter.com/s9NxKfb7pe

— Physical Intelligence (@physical_int) April 16, 2026

This system belongs to the "Vision-Language-Action" (VLA) class and is designed for robot control.

Unlike previous solutions, π0.7 has shown signs of compositional generalization—the ability to combine previously learned skills to tackle new tasks.

Untrained Tasks and Transfer Between Robots

In experiments, the model exhibited several unexpected capabilities. Notably, π0.7 was able to control a new type of robot and fold t-shirts, despite lacking training data for that specific platform.

Compositional generalization is a key capability of large models like LLMs, but it has been elusive in robotics. Another emergent ability we found is to control a new robot (UR5e) to fold t-shirts, even though we didn't have any laundry folding data on this robot. pic.twitter.com/lAXYag002Z

— Physical Intelligence (@physical_int) April 16, 2026

The results are comparable to operators with hundreds of hours of remote control experience, the programmers noted.

The tool also managed to understand how to use previously unfamiliar devices, including kitchen appliances. For instance, the robot completed part of a task to prepare sweet potatoes in an air fryer, even though such scenarios were not included in the training dataset.

According to the developers, this was made possible by integrating disparate skills—similar to how language models combine knowledge from different domains.

Control Through Language and Context

One of the key features of π0.7 is its ability to be controlled not just through commands of "what to do," but also through clarifications of "how to do it."

The model accepts:

  • text instructions;
  • metadata (e.g., speed and quality of execution);
  • visual subgoals—images of the expected outcome of a step.

Some of the subgoals can be generated by the auxiliary system during operation. This allows the robot to adjust its behavior without additional training.

π0.7 handles diverse prompts that don't just say what to do, but also how to do it, including rich language and multimodal information, such as visual subgoal images. At test time, these images can be produced by a lightweight world model. pic.twitter.com/cbdovdVjBG

— Physical Intelligence (@physical_int) April 16, 2026

This approach enables the integration of data from various sources—video, telemetry from robots, and autonomously collected episodes—into a unified training system.

A First Step Towards 'Universal' Robots

Physical Intelligence noted that earlier models required retraining for each task—similar to early versions of language models. In contrast, π0.7 works "out of the box" and adapts to new scenarios through language.

The team emphasized that this level of generalization has long been considered a strong point of LLMs, but remained elusive in robotics.

Despite the progress, the model does not always handle complex tasks without step-by-step prompts. However, the quality of execution significantly improves with sequential instructions.

In the future, such instructions could help train more autonomous machines capable of acting without human intervention. Physical Intelligence believes that π0.7 shows the first signs of a transition to universal robots that adapt to new conditions without manual tuning for each task.

Recall that in February, Carbon Robotics released the AI model Large Plant Model, which can recognize plant species to combat weeds.