[1710.09599] Watch Your Step: Learning Node Embeddings via Graph Attention
We believe that our contribution in replacing these sampling hyperparameters with a learnable context distribution is general and can be applied to many domains and modeling techniques in graph representation learning.

Abstract: Graph embedding methods represent nodes in a continuous vector space,
preserving information from the graph (e.g. by sampling random walks). There
are many hyper-parameters to these methods (such as random walk length) which
have to be manually tuned for every graph. In this paper, we replace random
walk hyper-parameters with trainable parameters that we automatically learn via
backpropagation. In particular, we learn a novel attention model on the power
series of the transition matrix, which guides the random walk to optimize an
upstream objective. Unlike previous approaches to attention models, the method
that we propose utilizes attention parameters exclusively on the data (e.g. on
the random walk), and not used by the model for inference. We experiment on
link prediction tasks, as we aim to produce embeddings that best-preserve the
graph structure, generalizing to unseen information. We improve
state-of-the-art on a comprehensive suite of real world datasets including
social, collaboration, and biological networks. Adding attention to random
walks can reduce the error by 20% to 45% on datasets we attempted. Further, our
learned attention parameters are different for every graph, and our
automatically-found values agree with the optimal choice of hyper-parameter if
we manually tune existing methods.

Figure 1: Datasets used in our experiments. Figure 2: Test ROC-AUC as a function of C using node2vec. Figure 3: In ?? we present statistics of our datasets. In ??, we motivate our work by showing the necessity of setting the parameter C for node2vec (d=128, each point is the average of 7 runs). (Training Objective)Figure 4: Learned Attention weights Q (log scale). Figure 5: Q with varying the regularization β (linear scale). Figure 6: (a) shows learned attention weights Q, which agree with grid-search of node2vec (Figure ??). (b) shows how varying β affects the learned Q. Note that distributions can quickly tail off to zero (ego-Facebook and PPI), while other graphs (wiki-vote) contain information across distant nodes. (Training Objective)Figure 7: node2vec, Cora Figure 8: Graph Attention (ours), Cora Figure 9: Classification accuracy Figure 10: Node Classification. Fig. (a)/(b): t-SNE visualization of node embeddings for Cora dataset. We note that both methods are unsupervised, and we have colored the learned representations by node labels. Fig. (c) However, quantitatively, our embeddings achieves better separation. (Link Prediction Experiments)Figure 11: Sensitivity Analysis of softmax attention model. Our method is robust to choices of both β and C. We note that it consistently outperforms even an optimally set node2vec. (Sensitivity Analysis)