[1912.11959v1] IS Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention
We will also analyze the empirical differences between the studied algorithmic tasks, and investigate why the self-attention mechanism may assist one task but harm another.
Abstract The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by RNNs could be replaced by active-memory mechanisms. In this work, we evaluate whether various activememory mechanisms could replace self-attention in a Transformer. Our experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling, but optimal results are mostly achieved by using both active-memory and self-attention mechanisms together. We also note that, for some specific algorithmic tasks, active-memory mechanisms alone outperform both the self attention and a combination of the two.
‹Figure 1: The active memory mechanism. In this case, the activememory is implemented in a unidirectional manner, with a kernel size 3. (Introduction)›