XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] π0: A Vision-Language-Action Flow Model for General Robot Control

π0: A Vision-Language-Action Flow Model for General Robot Control

TL;DR

  • Key ideas:
    • Integrate Vision-Language Model (VLM) visual-semantic knowledge into robot control.
    • Use flow matching to represent continuous, high-frequency action chunks.
    • This achieves fine-grained and coherent action generation.
    • Utilize cross-embodiment data for generalization across different robots.
  • slipt pre-training and post-training in robotics is similar to LLM’s pretrain/finetune/align approach:
    • Pre-training learns broad knowledge.
    • Post-training learns specific and stable strategy.

There are three main challenges:

Similar to language model training, our method is divided into two phases: pre-training and post-training.

Model

The model predicts p(Atot)p(A_t|o_t), where At=[at,at+1,,at+H1]A_t = [a_t, a_{t+1}, \ldots, a_{t+H-1}] is the action chunk to be predicted, and HH is the length of the action chunk, with H=50H=50.

fig3

Data and Training

Pre-training Data

Our Own Datasets: 903 million timesteps, 68 tasks.

To prevent some tasks from being overrepresented, the weight for each task-robot combination is n0.43n^{0.43}, where nn is the number of samples for that category.

Seven Robot Types Used:

Base Model Performance

How well does π0\pi_0 perform after pre-training on a variety of tasks present in the pre-training data?

Comparison with Other Models:

Results:

fig7

Language Following Ability

Three Variants of Instructions:

Findings:

Learning New Dexterous Tasks

Three Categories of Tasks:

Overall:

fig11

Compare with or without Pre-training

Can training only on fine-tuning data achieve the same performance as fine-tuning π0\pi_0?

Those tasks are complex, long-duration, and require multiple decisions and long-term planning.

Let’s use “π0\pi_0” to refer the model finetuned on pre-trained π0\pi_0.

Let’s use “π0\pi_0-only” to refer the model pretrained only on fine-tuning data.

Findings:

reference