[1910.11269] Towards Fine-Grained Prosody Control for Voice Conversion
Subjective evaluations show that the proposed method can achieve both high naturalness and high speaker similarity in challenging situations, such as the source speech is a singing song.
Abstract: In a typical voice conversion system, prior works utilize various acoustic
features (e.g., the pitch, voiced/unvoiced flag, aperiodicity) of the source
speech to control the prosody of generated waveform. However, the prosody is
related with many factors, such as the intonation, stress and rhythm. It is a
challenging task to perfectly describe the prosody through acoustic features.
To deal with this problem, we propose prosody embeddings to model prosody.
These embeddings are learned from the source speech in an unsupervised manner.
We conduct experiments on our Mandarin corpus recoded by professional speakers.
Experimental results demonstrate that the proposed method enables fine-grained
control of the prosody. In challenging situations (such as the source speech is
a singing song), our proposed method can also achieve promising results.
‹Fig. 1. The framework of the baseline system. (Introduction)Fig. 2. The framework of the proposed system. (Limitations)Fig. 3. The prosody reference encoder module. A 6-layer stack of 2D convolutions with ReLU activations, followed by a single-layer GRU with 1 unit and a tanh activation. (Framework Overview)Fig. 4. MOS test results for three types of conversion tasks. (Experimental Setup)Fig. 5. Similarity test results of the target speaker for three types of conversion tasks. (Subjective Evaluation)›
[1810.06865] Sequence-to-Sequence Acoustic Modeling for Voice Conversion[1904.05351] RawNet: Fast End-to-End Neural Vocoder[1512.01809] High quality voice conversion using prosodic and high-resolution spectral features[1804.11055] Collapsed speech segment detection and suppression for WaveNet vocoder[1510.04205] Reducing one-to-many problem in Voice Conversion by equalizing the formant locations using dynamic frequency warping
Related: Semantic Math
[1709.01357] Photometric stereo for strong specular highlights[1602.07377] How Deep Neural Networks Can Improve Emotion Recognition on Video Data[1801.09746] A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts[1702.05958] Reflection Separation Using Guided Annotation[1812.06232] Mapper Comparison with Wasserstein Metrics[1904.13362] The Level Weighted Structural Similarity Loss: A Step Away from the MSE[1701.08393] Faceness-Net: Face Detection through Deep Facial Part Responses[1908.11706] The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in Storytelling[1510.08406] Fast Landmark Subspace Clustering[1811.05389] Hallucinating Point Cloud into 3D Sculptural Object