[1911.09070v1] EfficientDet: Scalable and Efficient Object Detection
Based on these optimizations, we have developed a new family of detectors, named EfficientDet, which consistently achieve better accuracy and efficiency than the prior art across a wide spectrum of resource constraints
Abstract Model efficiency has become increasingly important in computer vision. In this paper, we systematically study various neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations, we have developed a new family of object detectors, called EfficientDet, which consistently achieve an order-ofmagnitude better efficiency than prior art across a wide spectrum of resource constraints. In particular, without bells and whistles, our EfficientDet-D7 achieves state-of-the-art 51.0 mAP on COCO dataset with 52M parameters and 326B FLOPS [Similar to [9, 31], FLOPS denotes number of multiply-adds.], being 4x smaller and using 9.3x fewer FLOPS yet still more accurate (+0.3% mAP) than the best previous detector.
‹Figure 1: Model FLOPS vs COCO accuracy – All numbers are for single-model single-scale. Our EfficientDet achieves much better accuracy with fewer computations than other detectors. In particular, EfficientDet-D7 achieves new state-of-the-art 51.0% COCO mAP with 4x fewer parameters and 9.3x fewer FLOPS. Details are in Table ??. (Introduction)Figure 2: Feature network design – (a) FPN [16] introduces a top-down pathway to fuse multi-scale features from level 3 to 7 (P3 P7); (b) PANet [19] adds an additional bottom-up pathway on top of FPN; (c) NAS-FPN [5] use neural architecture search to find an irregular feature network topology; (d)-(f) are three alternatives studied in this paper. (d) adds expensive connections from all input feature to output features; (e) simplifies PANet by removing nodes if they only have one input edge; (f) is our BiFPN with better accuracy and efficiency trade-offs. (BiFPN)Figure 3: EfficientDet architecture – It employs EfficientNet [31] as the backbone network, BiFPN as the feature network, and shared class/box prediction network. Both BiFPN layers and class/box net layers are repeated multiple times based on different resource constraints as shown in Table ??. (Weighted Feature Fusion)Figure 4: Model Size Figure 5: GPU Latency Figure 6: CPU Latency Figure 7: Model size and inference latency comparison – Latency is measured with batch size 1 on the same machine equipped with a Titan V GPU and Xeon CPU. AN denotes AmoebaNet + NAS-FPN trained with auto-augmentation [37]. Our EfficientDet models are 4x 6.6x smaller, 2.3x 3.2x faster on GPU, and 5.2x 8.1x faster on CPU than other detectors. (Experiments)Figure 8: Example Node 1 Figure 9: Example Node 2 Figure 10: Example Node 3 Figure 11: Softmax vs. fast normalized feature fusion – (a) - (c) shows normalized weights (i.e., importance) during training for three representative nodes; each node has two inputs (input1 & input2) and their normalized weights always sum up to 1. (Ablation Study)Figure 12: Comparison of different scaling methods – All methods improve accuracy, but our compound scaling method achieves better accuracy and efficiency trade-offs. (Compound Scaling)›