[1702.00887] Structured Attention Networks
It should be noted that the additional complexity in computing the attention distribution increases run-timefor example, structured attention was approximately 5 slower to train than simple attention for the neural machine translation experiments, even though both attention layers have the same asymptotic run-time (i.e

Abstract: Attention networks have proven to be an effective approach for embedding
categorical inference within a deep neural network. However, for many tasks we
may want to model richer structural dependencies without abandoning end-to-end
training. In this work, we experiment with incorporating richer structural
distributions, encoded using graphical models, within deep networks. We show
that these structured attention networks are simple extensions of the basic
attention procedure, and that they allow for extending attention beyond the
standard soft-selection approach, such as attending to partial segmentations or
to subtrees. We experiment with two different classes of structured attention
networks: a linear-chain conditional random field and a graph-based parsing
model, and describe how these models can be practically implemented as neural
network layers. Experiments show that this approach is effective for
incorporating structural biases, and structured attention networks outperform
baseline attention models on a variety of synthetic and real tasks: tree
transduction, neural machine translation, question answering, and natural
language inference. We further find that models trained in this way learn
interesting unsupervised hidden representations that generalize simple
attention.

‹Figure 1 Figure 2 Figure 3 Figure 4: Three versions of a latent variable attention model: (a) A standard soft-selection attention network, (b) A Bernoulli (sigmoid) attention network, (c) A linear-chain structured attention model for segmentation. The input and query are denoted with x and q respectively. (Structured Attention)Figure 5: Algorithms for linear-chain CRF: (left) computation of forward-backward tables α, β, and marginal probabilities p from potentials θ (forward-backward algorithm); (right) backpropagation of loss gradients with respect to the marginals ∇L p . C denotes the state space and hti is the special start/stop state. Backpropagation uses the identity ∇L log p = p ∇L p to calculate ∇L θ = ∇L log p∇log p θ , where is the element-wise multiplication. Typically the forward-backward with marginals is performed in the log-space semifield R∪{±∞} with binary operations ⊕ = logadd and ⊗ = + for numerical precision. However, backpropagation requires working with the log of negative values (since ∇L p could be negative), so we extend to a field [R ∪ {±∞}] × {+, −} with special +/− log-space operations. Binary operations applied to vectors are implied to be element-wise. The signexp function is defined as signexp(la) = sa exp(la). See Section ?? and Table ?? for more details. (End-to-End Training)

Figure 6: Visualization of the source self-attention distribution for the simple (left) and structured (right) attention models on the tree transduction task. $ is the special root symbol. Each row delineates the distribution over the parents (i.e. each row sums to one). The attention distribution obtained from the parsing marginals are more able to capture the tree structure—e.g. the attention weights of closing parentheses are generally placed on the opening parentheses (though not necessarily on a single parenthesis). (Tree Transduction)Figure 7: Visualization of the source attention distribution for the simple (top left), sigmoid (top right), and structured (bottom left) attention models over the ground truth sentence on the character-to-word translation task. Manually-annotated alignments are shown in bottom right. Each row delineates the attention weights over the source sentence at each step of decoding. The sigmoid/structured attention models are able learn an implicit segmentation model and focus on multiple characters at each time step. (Neural Machine Translation)Figure 8: Visualization of the attention distribution over supporting fact sequences for an example question in task 16 for the Binary CRF model. The actual question is displayed at the bottom along with the correct answer and the ground truth supporting facts (5 → 6 → 8). The edges represent the marginal probabilities p(zk, zk+1 | x, q), and the nodes represent the n supporting facts (here we have n = 9). The text for the supporting facts are shown on the left. The top three most likely sequences are: p(z1 = 5, z2 = 6, z3 = 8 | x, q) = 0.0564, p(z1 = 5, z2 = 6, z3 = 3 | x, q) = 0.0364, p(z1 = 5, z2 = 2, z3 = 3 | x, q) = 0.0356. (Question Answering)Figure 9: Forward step of the syntatic attention layer to compute the marginals, using the inside-outside algorithm (Baker, 1979) on the data structures of Eisner (1996). We assume the special root symbol is the first element of the sequence, and that the sentence length is n. Calculations are performed in log-space semifield with ⊕ = logadd and ⊗ = + for numerical precision. a, b ← c means a ← c and b ← c. a ←⊕ b means a ← a ⊕ b. (Forward/Backward through the Inside-Outside Algorithm)Figure 10: Backpropagation through the inside-outside algorithm to calculate the gradient with respect to the input potentials. ∇a b denotes the Jacobian of a with respect to b (so ∇L θ is the gradient with respect to θ). a, b ←⊕ c means a ← a ⊕ c and b ← b ⊕ c. (Forward/Backward through the Inside-Outside Algorithm)