[1804.06323] When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?
Our conclusions have practical effects on the recommendations for when and why pre-trained embeddings may be effective in NMT, particularly in low-resource scenarios: (1) there is a sweet-spot where word embeddings are most effective, where there is very little training data but not so little that the system cannot be trained at all, (2) pre-trained embeddings seem to be more effective for more similar translation pairs, (3) a priori alignment of embeddings may not be necessary in bilingual scenarios, but is helpful in multi-lingual training scenarios.
Abstract: The performance of Neural Machine Translation (NMT) systems often suffers in
low-resource scenarios where sufficiently large-scale parallel corpora cannot
be obtained. Pre-trained word embeddings have proven to be invaluable for
improving performance in natural language analysis tasks, which often suffer
from paucity of data. However, their utility for NMT has not been extensively
explored. In this work, we perform five sets of experiments that analyze when
we can expect pre-trained word embeddings to help in NMT tasks. We show that
such embeddings can be surprisingly effective in some cases -- providing gains
of up to 20 BLEU points in the most favorable setting.
‹Figure 1: BLEU and BLEU gain by data size. (Q2: Effect of Training Data Size)›
[1911.12391] SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling[1802.05368] Universal Neural Machine Translation for Extremely Low Resource Languages[1709.08898] Improving a Multi-Source Neural Machine Translation Model with Corpus Extension for Low-Resource Languages[1906.01942] Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron[1804.08915] Scheduled Multi-Task Learning: From Syntax to Translation
[1710.11041] Unsupervised Neural Machine Translation[1607.04606] Enriching Word Vectors with Subword Information[1606.04596] Semi-Supervised Learning for Neural Machine Translation[1710.04087] Word Translation Without Parallel Data[1706.09733] Stronger Baselines for Trustable Results in Neural Machine Translation[1601.01073] Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism[1611.04558] Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation[1412.6980] Adam: A Method for Stochastic Optimization[1611.02683] Unsupervised Pretraining for Sequence to Sequence Learning[1702.03859] Offline bilingual word vectors, orthogonal transformations and the inverted softmax