XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] 3D-VLA: A 3D Vision-Language-Action Generative World Model

3D-VLA: A 3D Vision-Language-Action Generative World Model

How to Integrate 3D Perception and Enable Robots to Understand 3D Scenes

This work focuses on integrating 3D perception into robotic systems to enhance their understanding of 3D environments.

Main Contributions

Dataset

The research utilizes a variety of datasets:

Generating Depth Information

Based on the depth information, point clouds and 3D bounding boxes for objects are generated.

ChatGPT is used to diversify the instructions.

Method

Since 3D datasets are much less abundant than 2D datasets, the system uses a Blip2 model as its backbone. It generates 3D features from images taken from different viewpoints, rather than directly processing point clouds.

An additional diffusion model is used to generate goal state 3D information:

fig2

Finetuning and Alignment of Diffusion Models

Experiments

The system’s spatial reasoning performance is compared against other VLM or 3D LLM models.

The ability of the 3D-VLA to generate RGB target images and point cloud targets is evaluated.

The system demonstrates generalization capabilities on two benchmarks: RLBench and CALVIN.

reference