[1910.04985v1] VarGFaceNet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition
To improve the interpretation ability of lightweight network, we employ an equivalence of angular distillation loss as our objective function and present a recursive knowledge distillation strategy
Abstract To improve the discriminative and generalization ability of lightweight network for face recognition, we propose an efficient variable group convolutional network called VarGFaceNet. Variable group convolution is introduced by VarGNet to solve the conflict between small computational cost and the unbalance of computational intensity inside a block. We employ variable group convolution to design our network which can support large scale face identification while reduce computational cost and parameters. Specifically, we use a head setting to reserve essential information at the start of the network and propose a particular embedding setting to reduce parameters of fully-connected layer for embedding. To enhance interpretation ability, we employ an equivalence of angular distillation loss to guide our lightweight network and we apply recursive knowledge distillation to relieve the discrepancy between the teacher model and the student model. The champion of deepglint-light track of LFR (2019) challenge demonstrates the effectiveness of our model and approach. Implementation of VarGFaceNet will be released at https://github.com/zma-c137/VarGFaceNet soon.
‹Figure 1. Settings of VarGFaceNet. a) is the normal block of VarGFaceNet. We add SE block on normal block of VarGNet. b) is the down sampling block. c) is head setting of VarGFaceNet. We do not use downsample in first conv in order to keep enough information. c) is the embedding setting of VarGFaceNet. We first expand channels from 320 to 1024. Then we employ variable group convolution and pointwise convolution to reduce the parameters and computational cost while remain essential information. (Introduction)Figure 2. The process of recursive knowledge distillation. We apply the first generation of student to initialize the second generation of student while the teacher model is remained. Angular distillation loss and arcface loss are used to guide training. (Angular Distillation Loss)