[Paper Note] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Existing Problem
While MaskGIT-like methods offer fast inference, their overall performance isn’t great.
Innovation
- Our method uses a combination of multi-modal and single-modal transformer layers.
- Language and vision representations are inherently different. We use cross-modal transformers to understand both text and images. Then, we use single-modal transformers to refine the visual representation.
- RoPE (Rotary Positional Embedding)
- We use masking rate conditioning to inform the model about the amount of valid information available. We also discretize this condition.
- We use micro-conditioning, incorporating original image resolution, crop coordinates, and human preference scores.
- Feature compression layers are used.
- Our 1B model achieves performance comparable to SDXL.
Architecture
- We use CLIP as the text encoder.
- VQ-VAE image encoder and decoder.
- Codebook size: 8192
- Multi-modal transformer backbone.
- We chose CLIP over T5 because it’s faster.
Feature Compression Layers
- The goal of these layers is to reduce the number of tokens representing an image, reducing the image representation from 64x64 to 32x32 before processing it through the transformer.
- We use 2D convolution-based feature compression layers.
Micro-Conditioning Details
- We use sinusoidal embeddings and concatenate these as additional channels to the final pooled hidden states of the text encoder.
Masking Strategy
- During training, the masking rate is sampled from an arccos distribution.
Training
- Loss: Cross-entropy loss
- Batch size (bz) = 2048, 100k steps
- 512 resolution, bz = 512, 100k steps
- 1024 resolution, bz = 256, 42k steps
- Training is very fast compared to other models. 48 H100 GPU days
Four Training Stages
- Stage 1: Focus on broad coverage of concepts
- Although LAION has quality issues, finely annotated datasets don’t comprehensively cover fundamental concepts, especially regarding human faces.
- We carefully curated the deduplicated LAION-2B dataset by filtering out images with aesthetic scores below 4.5, watermark probabilities exceeding 50%, and other criteria outlined in Kolors (2024).
- About 2M images, 256x256 resolution
- Stage 2: Aligning images with long prompts
- Aesthetic score > 8 from LAION
- 1.2M synthetic image-text pairs with long captions.
- 512x512 resolution
- We observed a significant boost in the model’s ability to capture abstract concepts and respond accurately to complex prompts, including diverse styles and fantasy characters.
- Stage 3: feature compression
- 1024 resolution
- By introducing feature compression layers, we achieve a seamless transition from 512x512 to 1024x1024 generation with minimal computational cost.
- Final Stage: human preference
- We fine-tune the model using a small learning rate, without freezing the text encoder. We also incorporate human preference scores as a micro-condition.
Results
- More focus on visual aesthetics
- Performance on the HPS v2 benchmark
- Multi-Dimensional Human Preference Score