Enable JavaScript to see more content
Recently hyped ML content linked in one simple page
Sources: reddit/r/{MachineLearning,datasets}, arxiv-sanity, twitter, kaggle/kernels, hackernews, awesome-datasets, sota changes
Made by: Deep Phrase HK Limited
1
|
|
[1911.04890v1] Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
We illustrated that our AV system slightly improves over an audio-only system when trained on the same amount of training data, but leads to significant performance improvement in presence of babble noise or overlapping speech
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set. Copyright 2019 IEEE. Published in the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), scheduled for December 14-18, 2019 in Sentosa, Singapore. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
‹ Fig. 1. Unsynchronized audio and visual frames. Stacked mel-spaced filterbank features occur at a 30ms frame rate. Video thumbnails occur at a 33-40 ms frame rate (25-30 frames per second). Fig. 2. Synchronized audio and visual frames. Stacked mel-spaced filterbank features occur at a variable frame rate, matching video rate. Video thumbnails occur at a 33-42 ms frame rate (30-24 frames per second). While the audio STFT analysis window remains 25ms, the shift is variable. (System Architecture) Fig. 3. RNN-T model architecture. Isolating modalities is made possible with the video and audio switches. (RNN-T for Audio-Visual Speech Recognition)›
|
|
Related: TFIDF
[1807.05162] Large-Scale Visual Speech Recognition[1809.02108] Deep Audio-Visual Speech Recognition[1611.05358] Lip Reading Sentences in the Wild[1903.00216] KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos[1808.05312] Toward domain-invariant speech recognition via large scale training
|
|
Related: TFIDF
[1807.05162] Large-Scale Visual Speech Recognition[1809.02108] Deep Audio-Visual Speech Recognition[1611.05358] Lip Reading Sentences in the Wild[1903.00216] KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos[1808.05312] Toward domain-invariant speech recognition via large scale training