[1806.04096] Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models
From the experiments conducted on a subset of the publicly available database NSynth, we can draw the following conclusions: i) Contrary to the literature on image processing, shallow autoencoders (AEs) do not here outperform principal component analysis (in terms of reconstruction accuracy)

Abstract: This study investigates the use of non-linear unsupervised dimensionality
reduction techniques to compress a music dataset into a low-dimensional
representation which can be used in turn for the synthesis of new sounds. We
systematically compare (shallow) autoencoders (AEs), deep autoencoders (DAEs),
recurrent autoencoders (with Long Short-Term Memory cells -- LSTM-AEs) and
variational autoencoders (VAEs) with principal component analysis (PCA) for
representing the high-resolution short-term magnitude spectrum of a large and
dense dataset of music notes into a lower-dimensional vector (and then convert
it back to a magnitude spectrum used for sound resynthesis). Our experiments
were conducted on the publicly available multi-instrument and multi-pitch
database NSynth. Interestingly and contrary to the recent literature on image
processing, we can show that PCA systematically outperforms shallow AE. Only
deep and recurrent architectures (DAEs and LSTM-AEs) lead to a lower
reconstruction error. The optimization criterion in VAEs being the sum of the
reconstruction error and a regularization term, it naturally leads to a lower
reconstruction accuracy than DAEs but we show that VAEs are still able to
outperform PCA while providing a low-dimensional latent space with nice
"usability" properties. We also provide corresponding objective measures of
perceptual audio quality (PEMO-Q scores), which generally correlate well with
the reconstruction error.

‹Figure 1: Global diagram of the sound analysistransformation-synthesis process. (Introduction)Figure 2: General architecture of a (deep) autoencoder. (Dimensionality Reduction Techniques)Figure 3: Reconstruction error (RMSE in dB) obtained with PCA, AE, DAE (with and without layer-wise training) and LSTM-AE, as a function of latent space dimension. Figure 4: Reconstruction error (RMSE in dB) obtained with VAEs as a function of latent space dimension (RMSE obtained with PCA is also recalled). (Experimental Results for Analysis-Resynthesis)Figure 5: PEMO-Q measures obtained with PCA, AE, DAEs (with and without layer-wise training) and LSTMAE, as a function of latent space dimension. Figure 6: PEMO-Q measures obtained with VAEs as a function of latent space dimension (measures obtained with PCA are also recalled). (Experimental Results for Analysis-Resynthesis)

Figure 7: Correlation matrices of the latent dimensions (average absolute correlation coefficients) for PCA, DAE, LSTM-AE and VAEs. (Decorrelation of the Latent Dimensions)Figure 8: Examples of decoded magnitude spectrograms after sound interpolation of 2 samples (top) in the latent space using respectively PCA (2nd row), LSTM-AE (3rd row) and VAE (bottom). A more detailed version of the figure can be found at https://goo.gl/Tvvb9e. (Examples of Sound Interpolation)›