Existing Problem

While MaskGIT-like methods offer fast inference, their overall performance isn’t great.

Innovation

Our method uses a combination of multi-modal and single-modal transformer layers.
- Language and vision representations are inherently different. We use cross-modal transformers to understand both text and images. Then, we use single-modal transformers to refine the visual representation.
RoPE (Rotary Positional Embedding)
We use masking rate conditioning to inform the model about the amount of valid information available. We also discretize this condition.
We use micro-conditioning, incorporating original image resolution, crop coordinates, and human preference scores.
Feature compression layers are used.
Our 1B model achieves performance comparable to SDXL.

Architecture

Feature Compression Layers

The goal of these layers is to reduce the number of tokens representing an image, reducing the image representation from 64x64 to 32x32 before processing it through the transformer.
We use 2D convolution-based feature compression layers.

Micro-Conditioning Details

We use sinusoidal embeddings and concatenate these as additional channels to the final pooled hidden states of the text encoder.

Masking Strategy

Training

Four Training Stages

Stage 1: Focus on broad coverage of concepts
- Although LAION has quality issues, finely annotated datasets don’t comprehensively cover fundamental concepts, especially regarding human faces.
- We carefully curated the deduplicated LAION-2B dataset by filtering out images with aesthetic scores below 4.5, watermark probabilities exceeding 50%, and other criteria outlined in Kolors (2024).
- About 2M images, 256x256 resolution
Stage 2: Aligning images with long prompts
- Aesthetic score > 8 from LAION
- 1.2M synthetic image-text pairs with long captions.
- 512x512 resolution
- We observed a significant boost in the model’s ability to capture abstract concepts and respond accurately to complex prompts, including diverse styles and fantasy characters.
Stage 3: feature compression
- 1024 resolution
- By introducing feature compression layers, we achieve a seamless transition from 512x512 to 1024x1024 generation with minimal computational cost.
Final Stage: human preference
- We fine-tune the model using a small learning rate, without freezing the text encoder. We also incorporate human preference scores as a micro-condition.