[1910.08485v1] Understanding Deep Networks via Extremal Perturbations and Smooth Masks
We have introduced the framework of extremal perturbation analysis, which avoids some of the issues of prior work that use perturbations to analyse neural networks

Abstract The problem of attribution is concerned with identifying the parts of an input that are responsible for a model’s output. An important family of attribution methods is based on measuring the effect of perturbations applied to the input. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable hyper-parameters from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the deep neural network under stimulation. We also extend perturbation analysis to the intermediate layers of a network. This application allows us to identify the salient channels necessary for classification, which, when visualized using feature inversion, can be used to elucidate model behavior. Lastly, we introduce TorchRay [github.com/facebookresearch/TorchRay], an interpretability library built on PyTorch.
‹Figure 1: Extremal perturbations are regions of an image that, for a given area (boxed), maximally affect the activation of a certain neuron in a neural network (i.e., “mousetrap” class score). As the area of the perturbation is increased, the method reveals more of the image, in order of decreasing importance. For clarity, we black out the masked regions; in practice, the network sees blurred regions. (Introduction)Figure 2: Comparison with other attribution methods. We compare our extremal perturbations (optimal area a∗ in box) to several popular attribution methods: gradient [23], guided backpropagation [26], Grad-CAM [22], and RISE [20]. (Introduction)Figure 3: Extremal perturbations and monotonic effects. Left: “porcupine” masks computed for several areas a (a in box). Right: Φ(ma ⊗ x) (preservation; blue) and Φ((1−ma)⊗x) (deletion; orange) plotted as a function of a. At a ≈ 15% the preserved region scores higher than preserving the entire image (green). At a ≈ 20%, perturbing the complementary region scores similarly to fully perturbing the entire image (red). (Method)Figure 4: Convolution operators for smooth masks. Gaussian smoothing a mask (blue) with the typical convolution operator yields a dampened, smooth mask (green). Our max-convolution operator mitigates this effect while still smoothing (red solid). Our smax operator, which yields more distributed gradients than max, varies between the other two convolution operators (red dotted). (Smooth masks)Figure 5: Area Growth. We show a few examples of masks constrained to match different areas. Although they are learned independently, these visualizations highlight what the network considers to be most discriminative (i.e., tusk of elephant, head of dog) and complete. (Smooth masks)Figure 6: Area growth. Although each mask is learned independently, these plots highlight what the network considers to be most discriminative and complete. The bar graph visualizes Φ(ma x) as a normalized fraction of Φ0 = Φ(x) (and saturates after exceeding Φ0 by 25%). (Smooth masks)Figure 7: (Smooth masks)Figure 8: Comparison with [7]. Our extremal perturbations (top) vs. masks from Fong and Vedaldi [7] (bottom). (Qualitative comparison)Figure 9: Sanity check [2]. Model weights are progressively randomized from fc8 to conv1 1 in VGG16, demonstrating our method’s sensitivity to model weights. (Qualitative comparison)Figure 10: Attribution at intermediate layers. Left: This is visualization (??) of the optimal channel attribution mask ma∗ , where a∗ = 25 channels, as defined in ??. Right: This plot shows that class score monotonically increases as the area (as the number of channels) increases. (Attribution at intermediate layers)Figure 11: Per-instance channel attribution visualization. Left: input image overlaid with channel saliency map (??). Middle: feature inversion of original activation tensor. Right: feature inversion of activation tensor perturbed by optimal channel mask ma∗ . By comparing the difference in feature inversions between un-perturbed (middle) and perturbed activations (right), we can identify the salient features that our method highlights. (Attribution at intermediate layers)Figure 12: Discovery of salient, class-specific channels. By analyzing m̄c, the average over all ma∗ for class c (see ??), we automatically find salient, class-specific channels like these. First column: channel feature inversions; all others: dataset examples. (Visualizing per-instance channel attribution)