XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] OpenVLA: An Open-Source Vision-Language-Action Model

OpenVLA: An Open-Source Vision-Language-Action Model

This paper shows a new paradigm for robot control: leveraging Vision-Language Models (VLMs) pre-trained on extensive robot data. The goal is to fine-tune these pre-trained VLMs, rather than training new behaviors from scratch. This approach is called Vision-Language-Action (VLA) model.

Traditional methods typically output language and vision embeddings and then train an action model from scratch. In contrast, this paper’s method directly fine-tunes the VLM to generate actions, which is named VLA.

Benefits of VLA

However, existing methods are either closed-source, making fine-tuning impossible, or are trained and tested only in specific scenarios.

Method

Tokenization

OpenX Dataset

Data Processing

Before the final experiments, small-scale training was conducted on the BridgeData V2 dataset.

Prismatic-7B, which uses SigLIP and DinoV2 as visual encoders, exhibits better spatial reasoning capabilities compared to other VLMs.

The input consists of a single image.

There was no performance difference observed between 224 and 384 resolutions, but the lower resolution significantly reduced training time.

While frozen vision encoders often perform better in VLM training, fine-tuning the vision encoder during VLA training is crucial in the VLA setting. We hypothesize that the pre-trained vision backbone may not capture sufficient fine-grained spatial details about important parts of the scene to enable precise robotic control.

The training involved 27 epochs, consuming 21,500 A100-hours, with a batch size of 2048.

Experiment

Zero-Shot Evaluation

Robot Platforms

Baselines

fig3

Conclusions

Fine-Tuning Evaluation

For fine-tuning tests, we used the Franka Emika Panda 7-DoF robot.

Baselines

results

LoRA and Quantization

We first compared the performance of fine-tuning different modules:

Model Precision

reference