XF-Blog
ProjectMachine LearningdevelopmentAbout

RECENT

PROJECT

MACHINE LEARNING

ABOUT

[Paper Note] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
https://arxiv.org/abs/2411.05007 https://github.com/mit-han-lab/nunchaku -xc-start a quantization method where both weights and activations are quantized to 4-bit. Activation outliers are first processed using a smoothing method. However, this leads to more pronounced weight outliers. SVD is then ...
Recent Project Machine Learning Development
Recent
[Paper Note] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
https://arxiv.org/abs/2411.05007 https://github.com/mit-han-lab/nunchaku a quantization method where both weights and activations are quantized to 4-bit. Activation outliers are first processed using a smoothing method. However, this leads to more pronounced weight outliers. SVD is then used to d... Read more
[Paper Note] The Super Weight in Large Language Models
https://arxiv.org/abs/2411.07191 Outliers with large magnitudes in LLMs significantly impact model performance. A single parameter can cause a boost in perplexity. The most important outliers are less than 0.01% of the total parameters. These super weights can be identified with just one forward pa... Read more
[Paper Note] Titans: Learning to Memorize at Test Time
https://arxiv.org/abs/2501.00663 Attention mechanisms and recurrent models each have their strengths and weaknesses: Attention can attend to the entire context window, but it comes with a high computational cost. Recurrent models compress the state into a fixed size, but they struggle to model depe... Read more
[Paper Note] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://arxiv.org/abs/2106.06103 Two encoders are used: one to generate a latent variable from the original speech spectrogram, and another to generate from the text. These variables should be as similar as possible. To address the problem of variable length between text and speech, an unsupe... Read more
[Paper Note] Emerging Properties in Self-Supervised Vision Transformers
https://arxiv.org/abs/2104.14294 DINO introduces a self-supervised method that doesn’t require labels or negative samples. It uses a teacher network and a student network, updating the teacher network’s parameters through EMA. The teacher network sees a global view, while the student network only... Read more
Project
preliminary experiment for LLM distillation and pretraining
This experiment is to verify the effectiveness of the various methods from papers.
This is a preliminary experiment for pretraining language model, and using distillation to accelerate training and improve performance. The experiment is to verify the effectiveness of the following method: DeepNet (https://arxiv.org/pdf/2203.00555.pdf) Distillation framework, and corresponding los... Read more
Are Small Language Models Low-rank?
Explore causal LM training and increase hidden dimension with low-rank matrices.
Over-parameterized language models are low rank intrinsically. In this project, I trained 2 causal language models with 28M parameters each, such that one is the baseline and another uses low-rank weights but has higher hidden dimensions, and compare their training speed and accuracy. Although they ... Read more
Revealing Category Preferences of ResNet Layers: Visualization Based on Web
Using web-based technology, this project visualizes the internal activation pattern of ResNet18 on the CIFAR10 dataset.
Using web-based technology, this project visualizes the internal activation pattern of ResNet18 on the CIFAR10 dataset. The visualization tries to show whether those kernels in convolutional layers tend to specialize to a certain class, and from a higher level, whether exists an internal activation ... Read more
An Evaluation of Four P300 ERP Classifiers' Generalization Performance in the Oddball Paradigm
P300 ERP is evoked when a person perceives a target stimuli, and it associates with the decision-making process that something important had occurred.
For classifying P300 event-related potential, usually need prior knowledge about the EEG signal during the target and non-target stimuli. However, different classifiers need different amounts of data to achieve a usable classification ability. In this final project, I explored 4 different classifier... Read more
Use Verlet Integration to Simulate Gravity
Finite Difference, Verlet Integration, and its Application
Online-demo: https://xiaonanfu-ucsd.github.io/verlet-gravity/ Differential equations are an important tool for classical mechanics, such as analyzing force and movement. In the simulation of the physical phenomenon of real-world objects, numerical differentiation is usually good enough to revea... Read more
Machine Learning
[Paper Note] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
https://arxiv.org/abs/2411.05007 https://github.com/mit-han-lab/nunchaku a quantization method where both weights and activations are quantized to 4-bit. Activation outliers are first processed using a smoothing method. However, this leads to more pronounced weight outliers. SVD is then used to d... Read more
[Paper Note] The Super Weight in Large Language Models
https://arxiv.org/abs/2411.07191 Outliers with large magnitudes in LLMs significantly impact model performance. A single parameter can cause a boost in perplexity. The most important outliers are less than 0.01% of the total parameters. These super weights can be identified with just one forward pa... Read more
[Paper Note] Titans: Learning to Memorize at Test Time
https://arxiv.org/abs/2501.00663 Attention mechanisms and recurrent models each have their strengths and weaknesses: Attention can attend to the entire context window, but it comes with a high computational cost. Recurrent models compress the state into a fixed size, but they struggle to model depe... Read more
[Paper Note] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://arxiv.org/abs/2106.06103 Two encoders are used: one to generate a latent variable from the original speech spectrogram, and another to generate from the text. These variables should be as similar as possible. To address the problem of variable length between text and speech, an unsupe... Read more
[Paper Note] Emerging Properties in Self-Supervised Vision Transformers
https://arxiv.org/abs/2104.14294 DINO introduces a self-supervised method that doesn’t require labels or negative samples. It uses a teacher network and a student network, updating the teacher network’s parameters through EMA. The teacher network sees a global view, while the student network only... Read more
[Paper Note] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
https://arxiv.org/abs/2006.16236 Aiming to address the quadratic complexity issue in Transformers. Self-attention can be expressed as a linear dot-product of kernel feature maps, achieving O(N) complexity. When applying a kernel with positive similarity scores on the queries and keys, linear a... Read more
[Paper Note] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Existing Problem While MaskGIT-like methods offer fast inference, their overall performance isn’t great. Innovation Our method uses a combination of multi-modal and single-modal transformer layers. Language and vision representations are inherently different. We use cross-modal transformers to und... Read more
[Paper Note] MaskGIT: Masked Generative Image Transformer
https://arxiv.org/abs/2202.04200 Unlike text, images are not sequential. This makes auto-regressive models unsuitable for image generation tasks. During training, MaskGIT is trained on a masked prediction task, similar to what is used in BERT. Inference: At each iteration, the model predicts a... Read more
[Paper Note] Learning Transferable Visual Models From Natural Language Supervision
CLIP maps images and text into the same embedding space. It is trained using contrastive learning. Within each batch, the similarity between correct image and text feature pairs is maximized, while the similarity between incorrect pairs is minimized. Both the image and text encoders are trained from... Read more
[Paper Note] Language Models are Unsupervised Multitask Learners
A general purpose training procedure that can be applied to a variety of NLP tasks in a zero-shot manner.
This paper presents a general purpose training procedure that can be applied to a variety of NLP tasks, using task instructions and task input as conditioning factors. A model trained with a massive, diverse, and unsupervised dataset can handle many tasks in a zero-shot manner and typically outperfo... Read more
Development
Migrate Ubuntu on Btrfs to a New Disk
System: Kubuntu 22.04 There are two partitions on the original disk, EFI and Ubuntu root partition The Ubuntu root partition is using Btrfs filesystem, with two subvolumes, @ and @home Using sudo fdisk /dev/[disk] to partition For example, sudo fdisk /dev/nvme1n1 Create GPT partition table Cre... Read more