FAST: Efficient Action Tokenization for Vision-Language-Action Models

Naive tokenization strategies based on a per-dimension, per-timestep binning scheme face a significant issue: tokens are too strongly correlated. This makes the prediction task overly simple, often leaving models stuck in poor local optima.

Our key insight is that robot action signals must be compressed before training to reduce correlation between consecutive tokens. While Byte Pair Encoding (BPE) can be used, handling continuous numbers effectively requires a specific approach: Discrete Cosine Transform (DCT) encoding.

Previous Action Representations

Semantic Representations

Language sub-tasks or keypoints.
The main problem is the need for a manually designed mapping from semantics to specific execution commands.

Low-level Robot Control Commands

Using an external diffusion model to generate continuous action sequences.
Per-dimension, per-timestep binning schemes.
Vector-quantized action representations: These methods perform well at coarse, low-fidelity reconstruction tasks but fail on high-frequency tasks where fine-grained control is required.

For smooth signals, a high sampling rate means changes between consecutive timesteps are minimal. This leads to a near-zero marginal information gain for each new token. Autoregressive models struggle to learn meaningful patterns because they can achieve low loss simply by copying the previous token, failing to learn complex behaviors.

Frequency-space Action Sequence Tokenization (FAST)

FAST uses the Discrete Cosine Transform (DCT) to convert continuous action signals into a frequency-space representation.

Action Chunk: Continuous action signals are divided into time windows. Each window contains $H$ timesteps and $D$ channels of action signals.
Time Length: Typically 1 second. The sampling rate is less critical because DCT naturally unifies the influence of frequency.

DCT is similar to the Fourier Transform but uses only cosine waveforms. It transforms a continuous 1D signal into a vector of the same length, where each dimension represents a coefficient for a specific frequency. These basis frequencies are orthogonal. The resulting vector represents coefficients whose linear combination reconstructs the original signal.

Why use this transformation?

Low-frequency coefficients usually have larger magnitudes. After quantization, many high-frequency coefficients become zero, making them easy for BPE to encode into tokens.
Unlike vector quantization, this is an analytical approach, making it simple and fast.
It is compatible with different action control frequencies.

Multi-channel Handling
In a time window, there are multiple channels (e.g., joint angles, velocities). Each channel undergoes DCT individually rather than using a 2D DCT, as correlation between different channels is relatively low.

The input and output of the DCT are both $D \times H$ matrices.

Quantization
After DCT, real numbers are quantized into integers:

The process involves scaling and then rounding.
The scaling coefficient is a hyperparameter that balances lossiness and the compression rate.
In experiments, this hyperparameter had little impact and was set to 10.

Flattening
The quantized matrix is flattened into a vector by columns (column-first flattening).

This means the same frequency components from different channels are placed together.
This proved more stable during training than row-first flattening (grouping all frequencies of a single dimension together).

The vocabulary size is set to 1024, which also showed minimal impact as a hyperparameter.

Experiments

The method was tested on two VLA backbones: $\pi_0$ and OpenVLA. The evaluation covered dexterous tasks and generalization tasks, such as manipulating unseen objects in new environments.

Comparisons

FSQ: Finite Scalar Quantization, which involves compression but requires more complex learning.
Naive tokenization: Standard binning based on different channels.

Key Findings

In high-frequency tasks, FAST shows a high compression rate, significantly reducing the number of tokens to be generated.
FAST significantly outperforms FSQ and naive tokenization in complex and high-frequency tasks.
On the $\pi_0$ $π_{0}$ model, FAST was compared against a diffusion-based version.
- In small datasets, both FAST and diffusion achieve similar performance.
- In large datasets, FAST converges several times faster, though diffusion is more time-efficient during inference.

Ablation Study

FAST performs well across different autoregressive VLA models, suggesting the approach is independent of the underlying VLA backbone.
If BPE is removed, performance drops but remains better than naive tokenization.
The DCT transform itself concentrates signal information into a few coefficients and turns many others to zero. This improves the learning signal and reduces redundancy.
BPE prevents a large number of repeated zero tokens from diluting the model’s learning signal.

reference

Openvla: An open-source vision-language-action model
pi 0: A vision-language-action flow model for general robot control
Finite scalar quantization: Vqvae made simple