XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

We hypothesize that video diffusion models can capture dynamic information and better predict the future physical world. This capability could provide valuable guidance for robot action learning.

Video Prediction Policy

For generalist policies, the visual module is crucial.

Our approach is divided into two steps:

Related Work

Limitations of Existing Approaches:

Insights from Diffusion Models:

Text-guided Video Prediction Model for Robot Manipulation

While video generation models are trained on large-scale datasets, they are not fully controllable and often fail to produce optimal results in specialized domains like robot manipulation.

Our base model is Stable Video Diffusion (SVD).

Action Learning Conditioned on Predictive Visual Representation

fig2

Our key insight is that even the first forward step of the diffusion process, while not producing a clear video, still provides a rough trajectory of future states and valuable guidance for action learning.

We found that the up-sampling layers in U-Net diffusion models yield more effective representations.

We concatenate the feature maps from these later layers after interpolating them to the same width and height.

To further extract information, we use a Transformer (specifically, a “Video Former”) to generate T x L tokens. This Transformer is conditioned on the concatenated feature maps. These feature maps can originate from multiple camera views, but they are treated as distinct inputs rather than being merged into a single feature map.

Finally, this information, along with the CLIP embeddings from natural language descriptions, conditions our diffusion policy model.

Experiments

We evaluated our approach using two standard benchmarks:

We adopted the dataset sampling strategy from Octo, an open-source generalist robot policy, given the varying scales and quality of the datasets used in this work.

Computational cost:

Since we use the video diffusion model solely as an encoder, we can achieve a control frequency of 7-10 Hz on an RTX 4090.

Ablation Study

We conducted several ablation studies to understand the contribution of different components: