[1910.10815] Low-frequency compensated synthetic impulse responses for improved far-field speech recognition
In addition, we also train a baseline model using clean speech and an oracle mode using real IRs with the same distribution as the test set
We propose a method for generating low-frequency compensated synthetic impulse responses that improve the performance of farfield speech recognition systems trained on artificially augmented datasets. We design linear-phase filters that adapt the simulated impulse responses to equalization distributions corresponding to realworld captured impulse responses. Our filtered synthetic impulse responses are then used to augment clean speech data from LibriSpeech dataset . We evaluate the performance of our method on the real-world LibriSpeech test set. In practice, our low-frequency compensated synthetic dataset can reduce the word-error-rate by up to 8.8% for far-field speech recognition.
‹Fig. 1. Recorded and simulated frequency responses (magnitude) and EQ sample points with logarithmic scale for x-axis. The recorded frequency response has more variation at certain frequency ranges. Our EQ sample points have higher density over low frequencies, where the simulated and recorded responses are most distinct. (Proposed Method)Fig. 2. Target sub-band EQ distribution (left) and re-sampled distribution (right) with fitted GMM. There is a good match between the two distributions, meaning that our GMM has successfully captured the variation of the target distribution. (Equalization Matching)Fig. 3. Spectrograms of a simulated IR, the simulated IR after EQ matching, and an example recorded IR. The simulated IR after EQ matching has energy distribution over low and high frequencies (due to room EQ) more similar to a recorded IR, despite the difference between their reverberation times. (Compensation Filters)›