卷积神经网络压缩中的知识蒸馏技术综述

doi:10.3778/j.issn.1673-9418.2104022

摘要/Abstract

摘要：

近年来，卷积神经网络（CNN）凭借强大的特征提取和表达能力，在图像分析领域的诸多应用中取得了令人瞩目的成就。但是，CNN性能的不断提升几乎完全得益于网络模型的越来越深和越来越大，在这个情况下，部署完整的CNN往往需要巨大的内存开销和高性能的计算单元（如GPU）支撑，而在计算资源受限的嵌入式设备以及高实时要求的移动终端上，CNN的广泛应用存在局限性。因此，CNN迫切需要网络轻量化。目前解决以上难题的网络压缩和加速途径主要有知识蒸馏、网络剪枝、参数量化、低秩分解、轻量化网络设计等。首先介绍了卷积神经网络的基本结构和发展历程，简述和对比了五种典型的网络压缩基本方法；然后重点针对知识蒸馏方法进行了详细的梳理与总结，并在CIFAR数据集上对不同方法进行了实验对比；其后介绍了知识蒸馏方法目前的评价体系，给出多类型方法的对比分析和评价；最后对该技术未来的拓展研究给出了初步的思考。

关键词: 卷积神经网络（CNN）, 知识蒸馏, 神经网络压缩, 轻量化网络

Abstract:

In recent years, convolutional neural network (CNN) has made remarkable achievements in many applications in the field of image analysis with its powerful ability of feature extraction and expression. However, the continuous improvement of CNN performance is almost entirely due to the deeper and larger network model. In this case, the deployment of a complete CNN often requires huge memory overhead and high-performance computing units (such as GPU) support. However, there are limitations in the wide application of CNN in embedded devices with limited computing resources and mobile terminals with high real-time requirements. Therefore, CNN urgently needs network lightweight. At present, the main ways to solve the above problems are knowledge distillation, network pruning, parameter quantization, low rank decomposition, lightweight network design, etc. This paper first introduces the basic structure and development process of convolutional neural network, and briefly describes and compares five typical basic methods of network compression. Then, the knowledge distillation methods are combed and summarized in detail, and the different methods are compared experimentally on the CIFAR data set. Furthermore, the current evaluation system of knowledge distillation methods is introduced. The comparative analysis and evaluation of many types of methods are given. Finally, the preliminary thinking on the future development of this technology is given.

Key words: convolutional neural network (CNN), knowledge distillation, neural network compression, lightweight network

孟宪法, 刘方, 李广, 黄萌萌. 卷积神经网络压缩中的知识蒸馏技术综述[J]. 计算机科学与探索, 2021, 15(10): 1812-1829.

MENG Xianfa, LIU Fang, LI Guang, HUANG Mengmeng. Review of Knowledge Distillation in Convolutional Neural Network Compression[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1812-1829.

参考文献

[1] JIANG Z T, QIN J Q, ZHANG S Q. Parameterized pooling convolution neural network for image classification[J]. Acta Electronica Sinica, 2020, 48(9): 1729-1734.
江泽涛, 秦嘉奇, 张少钦. 参数池化卷积神经网络图像分类方法[J]. 电子学报, 2020, 48(9): 1729-1734.
[2] LIU Y, ZHAN Y W. Survey of small object detection algorithms based on deep learning[J]. Computer Engineering and Applications, 2021, 57(2): 37-48.
刘洋, 战荫伟. 基于深度学习的小目标检测算法综述[J]. 计算机工程与应用, 2021, 57(2): 37-48.
[3] TIAN Q C, MENG Y. Image semantic segmentation based on convolutional neural network[J]. Journal of Chinese Computer Systems, 2020, 41(6): 1302-1313.
田启川, 孟颖. 卷积神经网络图像语义分割技术[J]. 小型微型计算机系统, 2020, 41(6): 1302-1313.
[4] FU Z Y, ZHOU S J, LI D G. Lightweight target recognition deep neural network and its application[J]. Computer Engineering and Applications, 2020, 56(18): 131-136.
付佐毅, 周世杰, 李顶根. 轻量级目标识别深度神经网络及其应用[J]. 计算机工程与应用, 2020, 56(18): 131-136.
[5] WANG Y C, LI Z H, HAO H Y, et al. Research on visual perception technology of autonomous driving based on improved convolutional neural network[J]. Journal of Physics: Conference Series, 2020, 1550: 032103.
[6] KULIKA S, SHTANKOA A. Using convolutional neural networks for recognition of objects varied in appearance in computer vision for intellectual robots[J]. Procedia Computer Science, 2020, 169: 164-167.
[7] KU H C, DONG W. Face recognition based on MTCNN and convolutional neural network[J]. Frontiers in Signal Processing, 2020, 4(1): 37-42.
[8] HAMEED N, SHABUT A, HAMEED F, et al. Mobile-based skin lesions classification using convolution neural network[J]. Annals of Emerging Technologies in Computing, 2020, 4(2): 1-12.
[9] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[10] LECUN Y, DENKER J S, SOLLA S A. Optimal brain damage[C]//Proceedings of the Advances in Neural Information Processing Systems, Denver, Nov 27-30, 1989. San Mateo: Morgan Kaufmann, 1989: 598-605.
[11] CARREIRA-PERPI?áN M á, IDELBAYEV Y. “Learning-compression” algorithms for neural net pruning[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8532-8541.
[12] WEN W, WU C P, WANG Y D, et al. Learning structured sparsity in deep neural networks[C]//Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 2074-2082.
[13] LI H, KADAV A, DURDANOVIC I, et al. Pruning filters for efficient ConvNets[C]//Proceedings of the 5th International Conference on Learning Representations, Toulon, Apr 24-26, 2017: 1-13.
[14] COURBARIAUX M, BENGIO Y, DAVID J P. BinaryConnect: training deep neural networks with binary weights during propagations[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 3123-3131.
[15] LI F, ZHANG B, LIU B. Ternary weight networks[C]//Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 2024-2032.
[16] XU Y H, WANG Y Z, ZHOU A J, et al. Deep neural network compression with single and multiple level quantization[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 4335-4342.
[17] LIN Z H, COURBARIAUX M, MEMISEVIC R, et al. Neural networks with few multiplications[C]//Proceedings of the 4th International Conference on Learning Representations, San Juan, May 2-4, 2016: 1-9.
[18] JADERBERG M, VEDALDI A, ZISSERMAN A. Speeding up convolutional neural networks with low rank expansions[J]. arXiv:1405.3866, 2014.
[19] LEBEDEV V, GANIN Y, RAKHUBA M, et al. Speeding-up convolutional neural networks using fine-tuned CP-decomposition[C]//Proceedings of the 3rd International Conference on Learning Representations, San Diego, May 7-9, 2015: 1-11.
[20] HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[21] HUANG G, SUN Y, LIU Z, et al. Deep networks with stochastic depth[C]//LNCS 9908: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 646-661.
[22] KIM J, PARK Y, KIM G, et al. SplitNet: learning to semantically split deep networks for parameter reduction and model parallelization[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017. New York: ACM, 2017: 1866-1874.
[23] HINTON G E, VINYALS O, DEAN J. Distilling the know-ledge in a neural network[J]. arXiv:1503.02531, 2015.
[24] ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer[C]//Proceedings of the 5th International Conference on Learning Representations, Toulon, Apr 24-26, 2017: 1-13.
[25] SRINIVAS S, FLEURET F. Knowledge transfer with Jacobian matching[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholm, Jul 10-15, 2018. New York: ACM, 2018: 4730-4738.
[26] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. arXiv:1406.2661, 2014.
[27] LANG L, XIA Y Q. Survey on compact neural network model design[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1456-1470.
郎磊, 夏应清. 紧凑的神经网络模型设计研究综述[J]. 计算机科学与探索, 2020, 14(9): 1456-1470.
[28] JI R R, LIN S H, CHAO F, et al. Deep neural network compression and acceleration: a review[J]. Journal of Computer Research and Development, 2018, 55(9): 1871-1888.
纪荣嵘, 林绍辉, 晁飞, 等. 深度神经网络压缩与加速综述[J]. 计算机研究与发展, 2018, 55(9): 1871-1888.
[29] LIN J D, WU X Y, CHAI Y, et al. Structure optimization of convolutional neural networks: a survey[J]. Acta Automatica Sinica, 2020, 46(1): 24-37.
林景栋, 吴欣怡, 柴毅, 等. 卷积神经网络结构优化综述[J]. 自动化学报, 2020, 46(1): 24-37.
[30] GENG L L, NIU B N. Survey of deep neural networks model compression[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1441-1455.
耿丽丽, 牛保宁. 深度神经网络模型压缩综述[J]. 计算机科学与探索, 2020, 14(9): 1441-1455.
[31] GAO H, TIAN Y L, XU F Y, et al. Survey of deep learning model compression and acceleration[J]. Journal of Software, 2021, 32(1): 68-92.
高晗, 田育龙, 许封元, 等. 深度学习模型压缩与加速综述[J]. 软件学报, 2021, 32(1): 68-92.
[32] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations, San Diego, May 7-9, 2015: 1-14.
[33] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[34] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1097-1105.
[35] ZEILER M D, FERGUS R. Visualizing and understanding convolutional networks[C]//LNCS 8689: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 5-12, 2014. Cham: Springer, 2014: 818-833.
[36] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 1-9.
[37] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2261-2269.
[38] CHEN Y P, LI J N, XIAO H X, et al. Dual path networks[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2017, Los Angeles, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 4467-4475.
[39] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 1-13.
[40] HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 7341-7349.
[41] SANDLER M, HOWARD A G, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 4510-4520.
[42] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size[J]. arXiv:1602.07360, 2016.
[43] ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 6848-6856.
[44] LI Q Q, JIN S Y, YAN J J. Mimicking very efficient network for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 7341-7349.
[45] WANG T, YUAN L, ZHANG X P, et al. Distilling object detectors with fine-grained feature imitation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4933-4942.
[46] ZHANG T T, DONG J Y, ZHAO H R, et al. Lightweight phytoplankton detection network based on knowledge distillation[J]. Journal of Applied Sciences, 2020, 38(3): 367-376.
张彤彤, 董军宇, 赵浩然, 等. 基于知识蒸馏的轻量型浮游植物检测网络[J]. 应用科学学报, 2020, 38(3): 367-376.
[47] HOU Y N, MA Z, LIU C X, et al. Inter-region affinity distillation for road marking segmentation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 12483-12492.
[48] SHAO H K, ZHONG D X, DU X F. Towards efficient unconstrained palmprint recognition via deep distillation Hashing[J]. arXiv:2004.03303, 2020.
[49] BHARDWAJ S, SRINIVASAN M, KHAPRA M M. Efficient video classification using fewer frames[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 354-363.
[50] WU H Y, LIU J, XIE Y, et al. Knowledge transfer dehazing network for nonhomogeneous dehazing[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 1975-1983.
[51] YANG C L, XIE L X, QIAO S Y, et al. Training deep neural networks in generations: a more tolerant teacher educates better students[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Hawaii, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 5628-5635.
[52] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2818-2826.
[53] PEREYRA G, TUCKER G, CHOROWSKI J, et al. Regularizing neural networks by penalizing confident output distributions[C]//Proceedings of the 5th International Conference on Learning Representations, Toulon, Apr 24-26, 2017: 1-12.
[54] CHO J H, HARIHARAN B. On the efficacy of knowledge distillation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 4793-4801.
[55] ROMERO A, BALLAS N, KAHOU S E, et al. FitNets: hints for thin deep nets[C]//Proceedings of the 3rd International Conference on Learning Representations, San Diego, May 7-9, 2015: 1-13.
[56] LEE S H, KIM D H, SONG B C. Self-supervised know-ledge distillation using singular value decomposition[C]//LNCS 11210: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 339-354.
[57] YIM J, JOO D, BAE J H, et al. A gift from knowledge distillation: fast optimization, network minimization and transfer learning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 7130-7138.
[58] HUANG Z H, WANG N Y. Like what you like: knowledge distill via neuron selectivity transfer[J]. arXiv:1707.01219, 2017.
[59] AHN S, HU S X, DAMIANOU A C, et al. Variational information distillation for knowledge transfer[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 9163-9171.
[60] KIM J, PARK S, KWAK N. Paraphrasing complex network: network compression via factor transfer[J]. arXiv:1802.04977, 2018.
[61] TIAN Y L, KRISHNAN D, ISOLA P. Contrastive representation distillation[C]//Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Apr 26-30, 2020: 1-15.
[62] ZHANG Z, NING G H, HE Z H. Knowledge projection for deep neural networks[J]. arXiv:1710.09505, 2017.
[63] PARK W, KIM D, LU Y, et al. Relational knowledge distillation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3967-3976.
[64] STOCK P, JOULIN A, GRIBONVAL R, et al. And the bit goes down: revisiting the quantization of neural networks[C]//Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Apr 26-30, 2020: 1-11.
[65] LIU Y F, CAO J J, LI B, et al. Knowledge distillation via instance relationship graph[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 7096-7104.
[66] GRIPON V, ORTEGA A, GIRAULT B. An inside look at deep neural networks using graph signal processing[C]//Proceedings of the 2018 Information Theory and Applications Workshop, San Diego, Feb 11-16, 2018. Piscataway: IEEE, 2018: 1-9.
[67] ANIRUDH R, THIAGARAJAN J J, SRIDHAR R, et al. MARGIN: uncovering deep neural networks using graph signal analysis[J]. arXiv:1711.05407, 2017.
[68] LASSANCE C, GRIPON V, ORTEGA A. Laplacian networks: bounding indicator function smoothness for neural network robustness[J]. APSIPA Transactions on Signal and Information Processing, 2021, 10: 1-12.
[69] LASSANCE C, BONTONOU M, HACENE G B, et al. Deep geometric knowledge distillation with graphs[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 8484-8488.
[70] FENG Y S, WANG H, HU R, et al. Triplet distillation for deep face recognition[C]//Proceedings of the 2020 IEEE International Conference on Image Processing, Abu Dhabi, Oct 25-28, 2020. Piscataway: IEEE, 2020: 808-812.
[71] HERMANS A, BEYER L, LEIBE B. In defense of the triplet loss for person re-identification[J]. arXiv:1703.07737, 2017.
[72] TUNG F, MORI G. Similarity-preserving knowledge distillation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 1365-1374.
[73] PENG B Y, JIN X, LI D S, et al. Correlation congruence for knowledge distillation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 5006-5015.
[74] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1449-1457.
[75] LEE S, SONG B C. Graph-based knowledge distillation by multi-head attention network[J]. arXiv:1907.02226, 2019.
[76] YADAV A K, SHAH S, XU Z, et al. Stabilizing adversarial nets with prediction methods[J]. arXiv:1705.07364, 2017.
[77] WANG X J, ZHANG R, SUN Y, et al. KDGAN: knowledge distillation with generative adversarial networks[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2018, Montreal, Dec 3-8, 2018. Red Hook: Curran Associates, 2018: 783-794.
[78] GAO D, ZHUO C. Private knowledge transfer via model distillation with generative adversarial networks[J]. arXiv: 2004.04631, 2020.
[79] XU Z, HSU Y C, HUANG J W. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks[J]. arXiv:1709.00513, 2017.