XF-Blog
ProjectMachine LearningdevelopmentAbout
TOPICS
EXPERIMENT
PAPER NOTE
MACHINE LEARNING
[Paper Note] PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers
PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers The goal of this paper is to enable characters to walk flexibly on complex terrain while mitigating issues like incorrect contacts or discontinuities. Current methods usually rely heavily on motion capture data, w... Read more
[Paper Note] 3D Gaussian Splatting for Real-Time Radiance Field Rendering
3D Gaussian Splatting for Real-Time Radiance Field Rendering 3D Gaussian Splatting (3DGS) aims to solve the problem of expensive neural network training and inference in NeRF, enabling real-time rendering. It avoids the need for random sampling used in NeRF, significantly accelerating the rendering ... Read more
[Paper Note] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis The core method of this approach is to represent a 3D scene using a Multilayer Perceptron (MLP). The network takes a 3D position and a viewing direction as input and outputs the corresponding color and volume density. Given the ... Read more
[Paper Note] 3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA: A 3D Vision-Language-Action Generative World Model This work focuses on integrating 3D perception into robotic systems to enhance their understanding of 3D environments. Main Contributions Previous Vision-Language-Action (VLA) models primarily relied on 2D image input and performed inferen... Read more
[Paper Note] DETR: End-to-End Object Detection with Transformers
https://arxiv.org/abs/2005.12872 Traditional object detection methods usually require manual post-processing steps to generate final bounding boxes. For example, non-maximum suppression (NMS) is commonly used to remove duplicate boxes. Furthermore, traditional methods often rely on an initial guess,... Read more
[Paper Note] OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model This paper shows a new paradigm for robot control: leveraging Vision-Language Models (VLMs) pre-trained on extensive robot data. The goal is to fine-tune these pre-trained VLMs, rather than training new behaviors from scratch. This approach is cal... Read more
[Paper Note] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Standard VLA models typically map language and vision directly to robot actions. Predicting the future can be used as a reasoning process to improve VLA capabilities. Common methods include predicting future video fr... Read more
[Paper Note] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Given the existence of powerful language models (LLMs) and image encoders, a natural progression is to integrate them. To prevent catastrophic forgetting, LLMs are often frozen, which can make tr... Read more
[Paper Note] TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis
TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis The goal of TEDi is to address the challenges of long-term motion generation. Traditional methods often suffer from inconsistencies when stitching diffusion-generated clips together, or they face significant latency issues. TEDi int... Read more
[Paper Note] π0: A Vision-Language-Action Flow Model for General Robot Control
π0: A Vision-Language-Action Flow Model for General Robot Control Key ideas: Integrate Vision-Language Model (VLM) visual-semantic knowledge into robot control. Use flow matching to represent continuous, high-frequency action chunks. This achieves fine-grained and coherent action generation. Ut... Read more
[Paper Note] FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST: Efficient Action Tokenization for Vision-Language-Action Models Naive tokenization strategies based on a per-dimension, per-timestep binning scheme face a significant issue: tokens are too strongly correlated. This makes the prediction task overly simple, often leaving models stuck in poor lo... Read more
[Paper Note] Genie: Generative Interactive Environments
Genie: Generative Interactive Environments Genie introduces a novel approach for creating Generative Interactive Environments, specifically targeting platformer games. This system is unsupervised and can generate action-controllable virtual worlds described through text, synthetic images, or even sk... Read more
[Paper Note] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations We hypothesize that video diffusion models can capture dynamic information and better predict the future physical world. This capability could provide valuable guidance for robot action learning. For generalis... Read more
[Paper Note] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection Open-set object detection refers to the task of detecting objects of unknown categories, not just those from a predefined set. For example, it could involve identifying a lion’s ear. This field is advanced by intr... Read more
[Paper Note] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
https://arxiv.org/abs/2411.05007 https://github.com/mit-han-lab/nunchaku a quantization method where both weights and activations are quantized to 4-bit. Activation outliers are first processed using a smoothing method. However, this leads to more pronounced weight outliers. SVD is then used to d... Read more
[Paper Note] The Super Weight in Large Language Models
https://arxiv.org/abs/2411.07191 Outliers with large magnitudes in LLMs significantly impact model performance. A single parameter can cause a boost in perplexity. The most important outliers are less than 0.01% of the total parameters. These super weights can be identified with just one forward pa... Read more
[Paper Note] Titans: Learning to Memorize at Test Time
https://arxiv.org/abs/2501.00663 Attention mechanisms and recurrent models each have their strengths and weaknesses: Attention can attend to the entire context window, but it comes with a high computational cost. Recurrent models compress the state into a fixed size, but they struggle to model depe... Read more
[Paper Note] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://arxiv.org/abs/2106.06103 Two encoders are used: one to generate a latent variable from the original speech spectrogram, and another to generate from the text. These variables should be as similar as possible. To address the problem of variable length between text and speech, an unsupe... Read more
[Paper Note] Emerging Properties in Self-Supervised Vision Transformers
https://arxiv.org/abs/2104.14294 DINO introduces a self-supervised method that doesn’t require labels or negative samples. It uses a teacher network and a student network, updating the teacher network’s parameters through EMA. The teacher network sees a global view, while the student network only... Read more
[Paper Note] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
https://arxiv.org/abs/2006.16236 Aiming to address the quadratic complexity issue in Transformers. Self-attention can be expressed as a linear dot-product of kernel feature maps, achieving O(N) complexity. When applying a kernel with positive similarity scores on the queries and keys, linear a... Read more
[Paper Note] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Existing Problem While MaskGIT-like methods offer fast inference, their overall performance isn’t great. Innovation Our method uses a combination of multi-modal and single-modal transformer layers. Language and vision representations are inherently different. We use cross-modal transformers to und... Read more
[Paper Note] MaskGIT: Masked Generative Image Transformer
https://arxiv.org/abs/2202.04200 Unlike text, images are not sequential. This makes auto-regressive models unsuitable for image generation tasks. During training, MaskGIT is trained on a masked prediction task, similar to what is used in BERT. Inference: At each iteration, the model predicts a... Read more
[Paper Note] Learning Transferable Visual Models From Natural Language Supervision
CLIP maps images and text into the same embedding space. It is trained using contrastive learning. Within each batch, the similarity between correct image and text feature pairs is maximized, while the similarity between incorrect pairs is minimized. Both the image and text encoders are trained from... Read more
[Paper Note] Language Models are Unsupervised Multitask Learners
A general purpose training procedure that can be applied to a variety of NLP tasks in a zero-shot manner.
This paper presents a general purpose training procedure that can be applied to a variety of NLP tasks, using task instructions and task input as conditioning factors. A model trained with a massive, diverse, and unsupervised dataset can handle many tasks in a zero-shot manner and typically outperfo... Read more