[1911.00792v2] An Algorithm for Routing Capsules in All Domains
Both networks achieve state-of-the-art performance in their respective domains after training with the same regime, thereby showing that adding one or more layers of our routing algorithm can produce stateof-the-art results in more than one domain, without requiring tuning

Abstract Building on recent work on capsule networks, we propose a new, generalpurpose form of “routing by agreement” that activates output capsules in a layer as a function of their net benefit to use and net cost to ignore input capsules from earlier layers. To illustrate the usefulness of our routing algorithm, we present two capsule networks that apply it in different domains: vision and language. [ In both domains, we use the same routing code, available at https://github.com/glassroom/heinsen routing along with pretrained models and replication instructions. ] The first network achieves new state-of-the-art accuracy of 99.1% on the smallNORB visual recognition task with fewer parameters and an order of magnitude less training than previous capsule models, and we find evidence that it learns to perform a form of “reverse graphics.” The second network achieves new state-of-the-art accuracies on the root sentences of the Stanford Sentiment Treebank: 58.5% on fine-grained and 95.6% on binary labels with a single-task model that routes frozen embeddings from a pretrained transformer as capsules. In both domains, we train with the same regime.
‹Figure 1: Test set accuracy and number of parameters of models that have achieved state-of-the-art results on smallNORB visual recognition. (Introduction)Figure 2: Test set accuracy and publication year of models that have achieved state-of-the-art results on SST-5/R root sentence classification. (Introduction)Figure 3: Our smallNORB model. (a) We stack each pair of images with (b) coordinate values evenly spaced from -1.0 to 1.0, horizontally and vertically, creating an input tensor of shape 4 × m × n, where m = n = 96 for unmodified images at test time. (c) We apply six 3 × 3 convolutions, each with 64 output channels and alternating strides of 1 and 2. Each convolution is preceded by batch normalization and followed by a Swish activation (Ramachandran et al., 2017) with constant β = 1. The last convolution outputs a tensor of shape 64 × m0 × n0. (d) We compute a(inp) and µ(inp) by applying two 1 × 1 convolutions, with 64 and 1024 output channels, respectively, and reshape them as shown. Both convolutions are preceded by batch normalization. After reshaping, a(inp) consists of 64m0n0 input scores, representing possible presence or absence of 64 toy parts in m0n0 image locations. µ(inp) consists of 64m0n0 slices of shape 4 × 4, each representing a pose for one of 64 parts in m0n0 locations. (e) We apply two layers of our routing algorithm; the first one routes a variable number of input capsules to 64 output capsules, each representing a larger toy part with a 4×4 pose; the second one routes those capsules to five capsules, each representing a type of toy with a 4 × 4 pose. For prediction, we apply a Softmax to a(out) . (Relationship to Description Length)Figure 5: Our SST model. (a) For each sample, the input is a tensor of transformer embeddings of shape n × l × m, where n is the number of tokens, l is the number of transformer layers, and m is the embedding size. (b) We element-wise add to the input tensor a depth-of-layer parameter of shape l × m. (c) We apply an affine transformation that maps m to 64, followed by a Swish activation with constant β = 1 and layer normalization, obtaining a tensor of shape n × l × 64. (d) We reshape the tensor as shown to obtain µ(inp) , consisting of ln input capsules of size 1 × 64. We compute a(inp) ←− log( x 1−x ) from a mask x of length nl with ones and zeros indicating, respectively, which embeddings correspond to tokens and which correspond to any padding necessary to group samples in batches, obtaining logits that are equal to ∞ for tokens, −∞ for padding, and values in between for any tokens and padding that get combined by mixup regularization in training. (e) We apply two layers of our routing algorithm; the first one routes a variable number of capsules in µ(inp) to 64 capsules of shape 1×2; the second one routes those capsules to five or two capsules of equal shape, each representing a classification label in SST-5/R or SST-2/R. For prediction, we apply a Softmax to output scores a(out) . (Analysis)

Figure 4: Multidimensional scaling (MDS) representations in R2 of the trajectories of an activated class capsule’s four pose vectors, each of size d(out) = 4, as we feed test images of an object in the class with varying elevations to our trained smallNORB model. For each image, the R2 coordinates are plotted as four connected vertices, each vertex corresponding to a pose vector, preserving as much as possible the pairwise distances between pose vectors from all images. Circles indicate the activated capsule’s first pose vector for the first and last image. (Analysis)Figure 8: Mean validation loss and accuracy after each epoch of training. Shaded area denotes standard deviation of batches, with 20 samples each. Note: smallNORB dataset does not have a validation split. (List of Supplementary Figures)Figure 9: Sample smallNORB stereographic images of one toy in each of five toy categories. For each toy we show 18 image-pair samples of varying azimuth while keeping elevation and lighting constant. (List of Supplementary Figures)Figure 6: For each toy category and instance in the test set, we feed our smallNORB model a sequence of images with varying azimuth, while keeping everything else constant, and plot the change in each pose vector of the activated capsule. The mean change is shown in dark blue. (a) The first row of plots show each pose vector’s Euclidean distance to its initial value, divided by the norm of the initial value, as azimuth varies from 0 to 340 degrees. We can see that the pose vector tends to move away from and then back close to its initial value, consistent with rotation. (b) The second row shows the norm of all pose vectors, divided by their initial norms. We can see that pose vector norms tend to stay close to the initial norm, consistent with rotation. (c) The third row shows cosine similarity between pose vectors and their initial value. We can see that the angle tends to increase and then decrease, consistent with rotation. (List of Supplementary Figures)Figure 7: For each toy category and instance in the test set, we feed our smallNORB model a sequence of images with varying elevation, while keeping everything else constant, and plot the change in each pose vector of the activated capsule. The mean change is shown in dark blue. (a) The first row of plots show each pose vector’s Euclidean distance to its initial value, divided by the norm of the initial value, as elevation varies through nine levels (from near flat to looking from above). We can see that the pose vector moves away from its initial value, consistent with the change in elevation. (b) The second row shows the norm of all pose vectors, divided by their initial norms. We can see that pose vector norms tend to stay close to the initial norm, consistent with rotation due to the change in elevation. (c) The third row shows cosine similarity between each pose vector and its initial value. We can see that the angle tends to increase but not decrease back to its original value, consistent with the change in elevation. (List of Supplementary Figures)›