TL;DR

Two encoders are used: one to generate a latent variable $z$ from the original speech spectrogram, and another to generate $z$ from the text. These $z$ variables should be as similar as possible.
To address the problem of variable $z$ length between text and speech, an unsupervised method is used to estimate the correspondence between phonemes $z$ and spectrogram $z$ .
A decoder $p_{\theta}(x|z)$ transforms the latent variable $z$ into a waveform. Instead of a diffusion-based method, a flow-based method is used to transform a simple $z$ variable distribution into a complex distribution.
Adversarial training is introduced to improve the diversity and naturalness of the generated speech.

Previous Work

Previous text-to-speech (TTS) systems typically involved two stages:

Preprocessing: This stage converts the input text into an intermediate speech representation, such as a mel-spectrogram. A mel-spectrogram decomposes a time-domain signal into its frequency components and scales them according to human ear sensitivity.
Waveform generation: In the second stage, raw waveforms are generated from the intermediate representation.

Autoregressive TTS methods exist, but they are not parallel.

Some end-to-end previous works use special spectrogram loss function to handle length differences between the target and generated speech. However, these methods have limitations in terms of speech quality.

Approach

In summary, the architecture consists of two encoders and one decoder:

Posterior Encoder: Generates a latent variable $z$ from the original speech.
Prior Encoder: Generates a latent variable $z$ from the text.
Because the $z$ vectors produced by the two encoders have different lengths, an alignment matrix is needed to represent the correspondence between them. This alignment matrix is learned in an unsupervised manner.
KL divergence is used to encourage the distributions of $z$ produced by the posterior and prior encoders to be similar.
Decoder: Transforms the latent variable $z$ into a waveform.
Instead of using diffusion between the encoder and decoder, methods like Normalizing Flow are employed to transform a simple $N(\mu, \sigma)$ distribution into a more complex one.

VAE Formulation

VITS can be formulated as a conditional VAE.
The optimization objective is the Evidence Lower Bound (ELBO):

\log{p_{\theta}(x|c)} \geq \mathbb{E}_{q_{\phi}(z|x)}\Big[\log{p_{\theta}(x|z)} - \log{\frac{q_{\phi}(z|x)}{p_{\theta}(z|c)}} \Big]

This expression estimates $p_{\theta}(x|c)$ , which is equivalent to $p_{\theta}(x)$ in regular variational inference. It means the possibility to generate certain voice $x$ .
$p_{\theta}(x|c)$ can be understood as the distribution of $x$ given condition $c$ , integrated over all possible $z$ . This is a marginal likelihood:
- $\int p_{\theta}(x|z) \cdot p_{\theta}(z|c) \, dz$
- Since this integral is difficult to compute, we approximate it using the ELBO.
$c$ is a conditional variable, representing the text corresponding to a given speech $x$ .
$z$ is a latent space variable, i.e. encoded representation vector.

The right-hand side of the inequality consists of two terms:

1. Reconstruction Loss

The reconstruction term, $\log{p_{\theta}(x|z)}$ , represents the probability of generating the original input $x$ under the generative model given $z$ . A value closer to 1 indicates that the model has accurately generated $x$ .
Computing $p_{\theta}(x|z)$ in its probabilistic form is difficult. Therefore, we approximate it using an L1 loss. This is similar to comparing two images pixel by pixel, rather than maximizing the probability of the second image. Using L1 instead of L2 makes the model less sensitive to outliers.
To make the generated speech more aligned with human preferences, we use the mel-spectrogram as the loss, even though the model’s input and output are raw spectrograms.
Spectrograms (not waveform) are used as input to provide more information to the posterior encoder. But decoder will output waveform.

2. KL-Divergence

The term $\log{\frac{q_{\phi}(z|x)}{p_{\theta}(z|c)}}$ encourages the latent variable $z$ to follow a prior distribution $p_{\theta}(z|c)$ , i.e. text latent vectors. KL divergence is used to measure the difference between the distributions.
$c$ consists of phonemes $c_{text}$ extracted from the text and an alignment matrix $A$ with dimensions $|c_{text}| \times |z|$ , where $|z|$ represents the sequence length of $z$ .
- $A$ is a hard monotonic attention mechanism, meaning its elements are either 1 or 0, and one text token $c$ will map to multiple $z$ tokens, while one $z$ token will only belong to one text token.
- The monotonic constraint ensures that the correspondence maintains the order of phonemes.

L_{kl} = \log{\frac{q_{\phi}(z|x)}{p_{\theta}(z|c_{text},A)}} = \log{q_{\phi}(z|x)} - \log{p_{\theta}(z|c_{text},A)}, \\ z \sim q_{\phi}(z|x) = N(z;\mu_{\phi}(x),\sigma_{\phi}(x))

We use neural networks to estimate $\mu_{\phi}(x)$ and $\sigma_{\phi}(x)$ , which represent the mean and variance, respectively.
There are two encoders: a prior encoder that models $p(z|c_{text},A)$ , generating $z$ from the text, and a posterior encoder that models $q(z|x)$ , generating $z$ from the original speech.
To improve the expressiveness of the prior distribution, we apply a normalizing flow $f_{\theta}$ to transform a simple distribution (normal distribution) into a more complex one.

Alignment Estimation

A text token can be mapped to multiple $z$ tokens. Therefore, an alignment matrix is needed to represent this correspondence. However, there is no ground truth alignment available.
Monotonic Alignment Search (MAS) is used, as described in the paper “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search.”
MAS finds the optimal alignment between text and $z$ . In our context, this means maximizing the ELBO.

Duration Prediction

Because the speaking rate of humans varies, it’s not possible to map a token to a fixed duration.
Therefore, we need a stochastic duration predictor to predict phoneme duration.
This predictor is a flow-based generative model.
One challenge is that duration is a discrete value (number of frames), while flow-based generative models require continuous values.
This is addressed using variational dequantization, as described in “Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design.”
Another challenge is that duration is a scalar, not a high-dimensional vector.
This is resolved using variational data augmentation, as described in “VFlow: More Expressive Generative Flows with Variational Data Augmentation.”

Adversarial Training

A discriminator compares generated speech and the ground truth waveform $y$ .
In addition to the standard adversarial loss, a layer-level contrastive L1 loss is introduced in the discriminator. This means that for the ground truth $y$ and the generated $x$ , the feature maps of each layer in the discriminator should be similar if $x$ and $y$ are close enough.
The advantage of this approach is that it not only compares the final waveform but also encourages the model to produce similar semantic information and other features.

Summary of loss functions

reconstruction loss: L1 loss
KL divergence loss: force text tokens and z tokens to be similar
duration loss: L1 loss
adversarial loss: L2 loss
discriminator feature map matching loss: L1 loss

Architecture

Posterior Encoder

Purpose: Generates a latent variable $z$ (a sequence of vectors) from the input speech.
Structure: Dilated CNN with gated activation units and skip connections.
speaker embedding as global conditioning.

Prior Encoder

Purpose: Encodes phonemes extracted from text.
Structure: A Transformer encoder followed by a linear layer to predict the mean and variance of the prior distribution.
To increase the complexity of the normal distribution, affine coupling layers are stacked on top.
- We add speaker embedding to the residual blocks in the normalizing flow through global conditioning.

Decoder

Purpose: Transforms the latent variable $z$ into a speech signal.
Structure: Uses the HiFi-GAN V1 generator (a CNN-based architecture).
To generate speech with different timbres, a speaker embedding is added to the $z$ variable during input.

Discriminator

The discriminator architecture is based on the multi-period discriminator proposed in HiFi-GAN.

Stochastic Duration Predictor

Purpose: Predicts the distribution of phoneme duration based on the text.
Structure: CNN and Neural Spline Flows.
Neural Spline Flows can transform a Gaussian distribution into a more complex one and offer stronger expressiveness than affine transformations.
Speaker embedding is added to the input.

Experiments

Datasets

LJ Speech dataset: Used for comparison with other methods.
VCTK dataset: Used to verify the diversity of generated speech.

Data Processing

Short-time Fourier transform (STFT) is used to extract linear spectrograms.
Text is converted to phonemes using the International Phonetic Alphabet.

Training Details

Optimizer: AdamW, with $\beta_1 = 0.1$ , $\beta_2 = 0.99$ , and weight decay = 0.01.
Learning rate: Initial value of $2 \times 10^{-4}$ , reduced every epoch.
Short audio clips are used during training to reduce memory usage.
Training duration: 800k steps with a batch size of 256.

Results

Evaluation Metrics

Naturalness: Human evaluators rate the naturalness of the generated speech. The volume of all samples is normalized.
Duration Diversity
Generation Speed

The naturalness of models trained on multi-speaker datasets is also evaluated.

To test the effectiveness of the stochastic duration predictor, the same sentence is generated multiple times, and the lengths of the generated audio are compared. The results demonstrate significant duration diversity.
The model also learns the different duration patterns associated with different speakers.

Thoughts

Why use adversarial training?

The reconstruction loss in VAEs (L1 loss) tends to minimize the average error, which can lead to a loss of detail in the generated output.
GANs can encourage the model to generate more diverse samples, avoiding overly “averaged” or blurry results.