XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
[Paper Note] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

https://arxiv.org/abs/2106.06103

TL;DR

  • Two encoders are used: one to generate a latent variable zz from the original speech spectrogram, and another to generate zz from the text. These zz variables should be as similar as possible.
  • To address the problem of variable zz length between text and speech, an unsupervised method is used to estimate the correspondence between phonemes zz and spectrogram zz.
  • A decoder pθ(xz)p_{\theta}(x|z) transforms the latent variable zz into a waveform. Instead of a diffusion-based method, a flow-based method is used to transform a simple zz variable distribution into a complex distribution.
  • Adversarial training is introduced to improve the diversity and naturalness of the generated speech.

Previous Work

Previous text-to-speech (TTS) systems typically involved two stages:

Autoregressive TTS methods exist, but they are not parallel.

Some end-to-end previous works use special spectrogram loss function to handle length differences between the target and generated speech. However, these methods have limitations in terms of speech quality.

Approach

In summary, the architecture consists of two encoders and one decoder:

  • Posterior Encoder: Generates a latent variable zz from the original speech.
  • Prior Encoder: Generates a latent variable zz from the text.
  • Because the zz vectors produced by the two encoders have different lengths, an alignment matrix is needed to represent the correspondence between them. This alignment matrix is learned in an unsupervised manner.
  • KL divergence is used to encourage the distributions of zz produced by the posterior and prior encoders to be similar.
  • Decoder: Transforms the latent variable zz into a waveform.
  • Instead of using diffusion between the encoder and decoder, methods like Normalizing Flow are employed to transform a simple N(μ,σ)N(\mu, \sigma) distribution into a more complex one.

VAE Formulation

logpθ(xc)Eqϕ(zx)[logpθ(xz)logqϕ(zx)pθ(zc)]\log{p_{\theta}(x|c)} \geq \mathbb{E}_{q_{\phi}(z|x)}\Big[\log{p_{\theta}(x|z)} - \log{\frac{q_{\phi}(z|x)}{p_{\theta}(z|c)}} \Big]

The right-hand side of the inequality consists of two terms:

1. Reconstruction Loss

2. KL-Divergence

Lkl=logqϕ(zx)pθ(zctext,A)=logqϕ(zx)logpθ(zctext,A),zqϕ(zx)=N(z;μϕ(x),σϕ(x))L_{kl} = \log{\frac{q_{\phi}(z|x)}{p_{\theta}(z|c_{text},A)}} = \log{q_{\phi}(z|x)} - \log{p_{\theta}(z|c_{text},A)}, \\ z \sim q_{\phi}(z|x) = N(z;\mu_{\phi}(x),\sigma_{\phi}(x))

Alignment Estimation

Duration Prediction

Adversarial Training

Summary of loss functions

Architecture

Posterior Encoder

Prior Encoder

Decoder

Discriminator

Stochastic Duration Predictor

Experiments

Datasets

Data Processing

Training Details

Results

Evaluation Metrics

The naturalness of models trained on multi-speaker datasets is also evaluated.

Thoughts

Why use adversarial training?

  • The reconstruction loss in VAEs (L1 loss) tends to minimize the average error, which can lead to a loss of detail in the generated output.
  • GANs can encourage the model to generate more diverse samples, avoiding overly “averaged” or blurry results.

Related papers