[1912.01219v2] WaveFlow: A Compact Flow-based Model for Raw Audio
Because the model is forced to learn all possible modes within the real data, the performance can be very good when it has enough model capacity

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. WaveFlow handles the longrange structure of waveform with a dilated 2-D convolutional architecture, while modeling the local variations using compact autoregressive functions. It provides a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow as special cases. WaveFlow can generate high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate waveforms with hundreds of thousands of timesteps. Furthermore, it can close the significant likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has 15× fewer parameters than WaveGlow and can generate 22.05 kHz high-fidelity audio 42.6× faster than real-time on a V100 GPU without engineered inference kernels. [In other words, the model can produce waveform samples at a rate of 939.3 kHz. Audio samples are in: https://waveflow-demo.github.io/.]‹Figure 1: The Jacobian ∂f−1 (x) ∂x of (a) an autoregressive transformation, and (b) a bipartite transformation. The blank cells are zeros and represent the independent relations between zi and xj. The light-blue cells are scaling variables and represent the linear dependencies between zi and xi. The dark-blue cells represent complex non-linear dependencies defined by neural networks. (Flow-based generative models)Figure 2: The receptive fields over the squeezed inputs X for computing Zi,j in (a) WaveFlow, (b) WaveGlow, (c) autoregressive flow with column-major order (e.g., WaveNet), and (d) autoregressive flow with row-major order. (Definition)