π0: A Vision-Language-Action Flow Model for General Robot Control

TL;DR

Key ideas:
- Integrate Vision-Language Model (VLM) visual-semantic knowledge into robot control.
- Use flow matching to represent continuous, high-frequency action chunks.
- This achieves fine-grained and coherent action generation.
- Utilize cross-embodiment data for generalization across different robots.
slipt pre-training and post-training in robotics is similar to LLM’s pretrain/finetune/align approach:
- Pre-training learns broad knowledge.
- Post-training learns specific and stable strategy.

Experience in other fields shows that pre-training a general vision-language model on diverse datasets, then fine-tuning it for specific tasks, outperforms training solely on specific tasks.
This approach can, to some extent, alleviate data scarcity issues because generalist models have many more data sources, including non-robot sources.
For robustness, diverse datasets may contain corrections and recovery behaviors.
Therefore, generalist models may solve problems related to data availability, generalization, and robustness.

There are three main challenges:

Large-scale datasets and pre-training.
The right model architectures that are highly compatible yet good enough for specific tasks.
The right training recipe.

Similar to language model training, our method is divided into two phases: pre-training and post-training.

Pre-training allows the model to learn various tasks at rudimentary proficiency and follow language commands.
For complex and dexterous tasks, we then employ a post-training procedure.

Model

We use PialiGemma as the Vision-Language Model (VLM).
In addition to images and text as conditions, we also include proprioceptive state.
We add an extra set of parameters for flow matching to denoise the action sequence.
This part primarily references “Transfusion: Predict the next token and diffuse images with one multi-modal model.”
However, unlike Transfusion, we separate the VLM and the flow model here.
This is a Flow model, not a diffusion model. It predicts a vector field, not a score function.

The model predicts $p(A_t|o_t)$ , where $A_t = [a_t, a_{t+1}, \ldots, a_{t+H-1}]$ is the action chunk to be predicted, and $H$ is the length of the action chunk, with $H=50$ .

During inference, we use 10 steps.
The action model size is 300M, while PialiGemma is 3B.

fig3

Data and Training

Pre-training data should be as diverse as possible, even if some data quality is not high. Intuitively, diverse (but lower quality) pre-training data allows the model to recover from mistakes and handle highly varied situations.
Approximately 10,000 hours of data are used.
Post-training data should be of the highest possible quality.

Pre-training Data

9.1% from open-source datasets:
- OXE: Open X-Embodiment: Robotic learning datasets and RT-X models
- BridgeData v2: A dataset for robot learning at scale
- DROID: A large-scale in-the-wild robot manipulation dataset
Robots and tasks in these datasets typically have one or two cameras and use low-frequency control (between 2 and 10 Hz).
However, these datasets cover a wide range of objects and environments.

Our Own Datasets: 903 million timesteps, 68 tasks.

106 million steps are from single-arm robots.
797 million steps are from dual-arm robots.
Tasks are more complex than simple noun-verb combinations, for example, “bussing a table,” which requires placing various items in designated locations.

To prevent some tasks from being overrepresented, the weight for each task-robot combination is $n^{0.43}$ , where $n$ is the number of samples for that category.

Seven Robot Types Used:

UR5e: An arm with a parallel jaw gripper, with a wrist-mounted and over-the-shoulder camera.
Bimanual UR5e.
Franka.
Bimanual Trossen.
Bimanual ARX & Bimanual AgileX.
Mobile Trossen & Mobile ARX.
Mobile Fibocom.

Base Model Performance

How well does $\pi_0$ perform after pre-training on a variety of tasks present in the pre-training data?

Comparison with Other Models:

$\pi_0$ -small: Does not use VLM initialization, uses DistilBERT as text encoder, and a smaller pre-trained ViT as image encoder. It has 470M parameters.
OpenVLA, 7B: Uses all our datasets but with fewer epochs.
Octo, 93M: A generalist model, not VLA but a diffusion model. Uses all our datasets but with fewer epochs.
$\pi_0$ parity: Uses the same number of epochs as OpenVLA and Octo for fair comparison.

Results:

$\pi_0$ significantly outperforms baseline models on more complex tasks.
$\pi_0$ parity still outperforms other baselines, demonstrating the effectiveness of its model architecture.
$\pi_0$ -small outperforms OpenVLA and Octo but is not as good as $\pi_0$ parity, highlighting the importance of VLM.

fig7

Language Following Ability

We compare only $\pi_0$ and $\pi_0$ -small.
$\pi_0$ significantly outperforms $\pi_0$ -small in language following accuracy. VLM initialization substantially improves the model’s ability to understand and execute intermediate language steps.

Three Variants of Instructions:

Flat: Direct command, e.g., “bag the groceries.”
Human: Step-by-step intermediate language instructions provided by human experts.
High-level VLM policy: Intermediate language instructions generated by a VLM.

Findings:

$\pi_0$ -small cannot learn useful intermediate steps from human expert instructions, whereas $\pi_0$ can.
The high-level VLM policy performs slightly worse than human expert instructions.

Learning New Dexterous Tasks

Three Categories of Tasks:

Easy: Similar to tasks in the pre-training data, e.g., UR5e stack bowls, towel folding.
Medium: “Tupperware in microwave” – involves an unseen object (microwave). Introduces new objects.
Hard: Significantly different from pre-training, e.g., paper towel replacement, Franka items in drawer.

Overall:

Fine-tuning based on $\pi_0$ generally outperforms other methods, especially showing stronger data efficiency when fine-tuning data is limited (e.g., 1 hour).

fig11

Compare with or without Pre-training

Can training only on fine-tuning data achieve the same performance as fine-tuning $\pi_0$ ?

Those tasks are complex, long-duration, and require multiple decisions and long-term planning.

Let’s use “ $\pi_0$ ” to refer the model finetuned on pre-trained $\pi_0$ .

Let’s use “ $\pi_0$ -only” to refer the model pretrained only on fine-tuning data.

Findings:

For tasks included in the pre-training data, $\pi_0$ significantly outperforms the $\pi_0$ -only
For tasks not included in the pre-training data, $\pi_0$ still leads, but the gap is smaller.

reference

Open X-Embodiment: Robotic learning datasets and RT-X models.
Rt-2: Vision-language-action models transfer web knowledge to robotic control.
Tinyvla: Towards fast, data-efficient vision-languageaction models for robotic manipulation
Diffusion policy: Visuomotor policy learning via action diffusion
Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation.
Playground v3: Improving text-to-image alignment with deep-fusion large language models
Transfusion: Predict the next token and diffuse images with one multi-modal model.