XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Standard VLA models typically map language and vision directly to robot actions.

Predicting the future can be used as a reasoning process to improve VLA capabilities. Common methods include predicting future video frames or trajectories. However, these methods have several drawbacks:

Method

Inputs

Queries

Model Architecture

Loss Functions

Attention Mask

The model uses an additional action diffusion model because action embeddings and world embeddings share the same latent space and similar statistics, making them difficult to distinguish using a simple MLP.

Experiment

The model was tested on:

Baselines

DreamVLA achieves the highest performance on ABC-D tasks. Real-world experiments for gripper grasping were also conducted on a Franka Panda robot arm.

Ablation Study

Which visual knowledge is most important?

Other findings:

reference