XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Genie: Generative Interactive Environments

Genie: Generative Interactive Environments

Genie introduces a novel approach for creating Generative Interactive Environments, specifically targeting platformer games. This system is unsupervised and can generate action-controllable virtual worlds described through text, synthetic images, or even sketches, without requiring ground-truth actions.

The system leverages publicly available Internet gaming videos. It incorporates a video tokenizer and extracts latent actions via a causal action model. Furthermore, a separate model trained on robotic datasets demonstrates the applicability of this method to generalist agents.

Method

Spatial-Temporal Transformer

Latent Action Model

fig5

Video Tokenizer

Dynamics Model

fig7

Training Process

Stage 1: Video Tokenizer Training

Stage 2: Co-training the Latent Action Model and Dynamics Model

Latent Action Model

Dynamics Model

If there is no user input, the latent action is set to 0.

Experiments

The Genie model was trained on 2D platformer games. The training data consisted of 16-second video clips sampled at 10 frames per second, totaling 30,000 hours of video content. The videos were processed at a resolution of 160x90 pixels.

Video Tokenizer:

Latent Action Model:

A sequence length of 16 frames was used for training all components.

batch size = 448 is better than 128 and 256, which means more data is better with same number of updates.

During inference, 25 MaskGIT steps were performed for the sampling of each frame, using random sampling with a temperature of 2.

The model demonstrated an understanding of 3D concepts and simulated parallax, where closer objects appear to move more significantly.

Ablation Study

whether LAM should learn actions directly from raw pixel images or from video tokens.

We compared different architectures for the video tokenizer: spatial-only ViT, C-ViVit, and ST-ViViT.