XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] FAST: Efficient Action Tokenization for Vision-Language-Action Models

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Naive tokenization strategies based on a per-dimension, per-timestep binning scheme face a significant issue: tokens are too strongly correlated. This makes the prediction task overly simple, often leaving models stuck in poor local optima.

Our key insight is that robot action signals must be compressed before training to reduce correlation between consecutive tokens. While Byte Pair Encoding (BPE) can be used, handling continuous numbers effectively requires a specific approach: Discrete Cosine Transform (DCT) encoding.

Previous Action Representations

Semantic Representations

Low-level Robot Control Commands

For smooth signals, a high sampling rate means changes between consecutive timesteps are minimal. This leads to a near-zero marginal information gain for each new token. Autoregressive models struggle to learn meaningful patterns because they can achieve low loss simply by copying the previous token, failing to learn complex behaviors.

Frequency-space Action Sequence Tokenization (FAST)

FAST uses the Discrete Cosine Transform (DCT) to convert continuous action signals into a frequency-space representation.

DCT is similar to the Fourier Transform but uses only cosine waveforms. It transforms a continuous 1D signal into a vector of the same length, where each dimension represents a coefficient for a specific frequency. These basis frequencies are orthogonal. The resulting vector represents coefficients whose linear combination reconstructs the original signal.

Why use this transformation?

Multi-channel Handling
In a time window, there are multiple channels (e.g., joint angles, velocities). Each channel undergoes DCT individually rather than using a 2D DCT, as correlation between different channels is relatively low.

Quantization
After DCT, real numbers are quantized into integers:

Flattening
The quantized matrix is flattened into a vector by columns (column-first flattening).

The vocabulary size is set to 1024, which also showed minimal impact as a hyperparameter.

Experiments

The method was tested on two VLA backbones: π0\pi_0 and OpenVLA. The evaluation covered dexterous tasks and generalization tasks, such as manipulating unseen objects in new environments.

Comparisons

Key Findings

Ablation Study

reference