多教师对比知识反演的无数据模型压缩方法

doi:10.3778/j.issn.1673-9418.2204107

摘要/Abstract

摘要： 知识蒸馏是用于压缩深度神经网络的一种有效方法，但是由于用户数据隐私保护、数据机密性或传输的限制，很多时候人们无法获取到原始数据。现有的无数据知识蒸馏方法仅使用单教师模型进行有偏特征统计，生成的数据和原始数据相比存在着多样性和泛化性差问题，从而导致压缩后模型的准确率不高。为了解决此类问题，提出了一种多教师对比知识反演的无数据模型压缩方法（MTCKI），该方法从多个可用的教师模型中提取知识并将其融合到学生模型中，以消除模型有偏统计带来的偏差，增强了合成图片的泛化性。为提升合成的图像多样性，采用对比学习的策略将当前批次生成的图像与历史的图像进行对比，迫使生成器合成与历史不相似的图片。同时，提出多教师-学生对比的策略，进一步提升学生网络的表征能力。实验表明，该方法不仅能生成视觉上令人满意的图像，而且在多个指标上优于现有的方法。生成的合成图像更接近原始数据集的分布，而且只需要一次生成的图片数据集就能泛化用于不同模型训练。

关键词: 模型压缩, 无数据, 知识蒸馏, 数据保护, 隐私保护

Abstract: Knowledge distillation is an effective method for model compression with access to training data. However, due to privacy, confidentiality, or transmission limitations, people cannot get the support of data. Existing data-free knowledge distillation methods only use biased feature statistics contained in one model and run into pro-blems with low generalizability and diversity in synthetic images and unsatisfactory student model performance. To address these problems, this paper proposes a multi-teacher contrastive knowledge inversion (MTCKI) method that extracts and fuses model-specific knowledge from the available teacher models into a student model to eliminate model bias. Further, this paper improves the diversity of synthesized images using contrastive learning, which encourages the synthetic images to be distinguishable from the previously stored images. Meanwhile, this paper proposes the strategy of contrastive loss based on multi-teacher and student to improve the feature representation ability of student network. Experiments demonstrate that MTCKI not only can generate visually satisfactory images but also outperforms existing state-of-the-art approaches. The resulting synthesized images are much closer to the distribution of the original dataset and can be generated only once to provide comprehensive guidance for various networks rather than a specific one.

Key words: model compression, data-free, knowledge distillation, data protection, privacy protection

林振元, 林绍辉, 姚益武, 何高奇, 王长波, 马利庄. 多教师对比知识反演的无数据模型压缩方法[J]. 计算机科学与探索, 2023, 17(11): 2721-2733.

LIN Zhenyuan, LIN Shaohui, YAO Yiwu, HE Gaoqi, WANG Changbo, MA Lizhuang. Multi-teacher Contrastive Knowledge Inversion for Data-Free Distillation[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(11): 2721-2733.

参考文献

[1] HEO B, KIM J, YUN S, et al. A comprehensive over-haul of feature distillation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 1921-1930.
[2] ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer[J]. arXiv:1612.03928, 2016.
[3] ROMERO A, BALLAS N, KAHOU S E, et al. Fitnets: hints for thin deep nets[J]. arXiv:1412.6550, 2014.
[4] 孟宪法, 刘方, 李广, 等. 卷积神经网络压缩中的知识蒸馏技术综述[J]. 计算机科学与探索, 2021, 15(10): 1812-1829.
MENG X F, LIU F, LI G, et al. Review of knowledge dis-tillation in convolutional neural network compression[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1812-1829.
[5] 耿丽丽, 牛保宁. 深度神经网络模型压缩综述[J]. 计算机科学与探索, 2020, 14(9): 1441-1455.
GENG L L, NIU B N. Survey of deep neural networks model compression[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1441-1455.
[6] CHEN H T, WANG Y H, XU C, et al. Data-free learning of student networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 3513-3521.
[7] FANG G F, SONG J, WANG X C, et al. Contrastive model in-version for data-free knowledge distillation[J]. arXiv:2105.08584, 2021.
[8] NAYAK G K, MOPURI K R, SHAJ V, et al. Zero-shot know-ledge distillation in deep networks[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 4743-4751.
[9] CHOI Y, CHOI J, EL-KHAMY M, et al. Data-free network quantization with adversarial knowledge distillation[C]//Pro-ceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Pis-cataway: IEEE, 2020: 3047-3057.
[10] YIN H X, MOLCHANOV P, ALVAREZ J M, et al. Dreaming to distill: data-free knowledge transfer via deep inversion[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 8712-8721.
[11] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[12] BA J, CARUANA R. Do deep nets really need to be deep?[C]//Advances in Neural Information Processing Systems 27, Montreal, Dec 8-13, 2014: 1-9.
[13] JUNG S, LEE D, PARK T, et al. Fair feature distillation for visual recognition[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 12115-12124.
[14] LAN X, ZHU X T, GONG S G. Knowledge distillation by on-the-fly native ensemble[C]//Advances in Neural Information Processing Systems 31, Montréal, Dec 3-8, 2018: 7528-7538.
[15] XIANG L Y, DING G G, HAN J G. Learning from multiple experts: self-paced knowledge distillation for long-tailed classi-fication[C]//LNCS 12350: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 247-263.
[16] LIU I J, PENG J, SCHWING A G. Knowledge flow: improve upon your teachers[J]. arXiv:1904.05878, 2019.
[17] YOU S, XU C, XU C, et al. Learning from multiple teacher networks[C]//Proceedings of the 23rd ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining, Halifax, Aug 13-17, 2017. New York: ACM, 2017: 1285-1294.
[18] ZHOU P, MAI L, ZHANG J M, et al. M2kd: multi-model and multi-level knowledge distillation for incremental learning[J]. arXiv:1904.01769, 2019.
[19] LOPES R G, FENU S, STARNER T. Data-free knowledge distillation for deep neural networks[J]. arXiv:1710.07535, 2017.
[20] LIU Y, ZHANG W, WANG J. Zero-shot adversarial quanti-zation[C]//Proceedings of the 2021 IEEE Conference on Com-puter Vision and Pattern Recognition, Jun 19-25, 2021. Pis-cataway: IEEE, 2021: 1512-1521.
[21] CAI Y H, YAO Z W, DONG Z, et al. ZeroQ: a novel zero shot quantization framework[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Re-cognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 13166-13175.
[22] HAROUSH M, HUBARA I, HOFFER E, et al. The knowledge within: methods for data-free model compression[C]//Procee-dings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscata-way: IEEE, 2020: 8491-8499.
[23] LI Y H, ZHU F, GONG R H, et al. MixMix: all you need for data-free compression are feature and data mixing[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 4390-4399.
[24] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Ma-chine Learning, Jul 13-18, 2020: 1597-1607.
[25] HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduc-tion by learning an invariant mapping[C]//Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, Jun 17-22, 2006. Washington: IEEE Computer Society, 2006: 1735-1742.
[26] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 9726-9735.
[27] VAN DEN OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J]. arXiv:1807.03748, 2018.
[28] WU Z R, XIONG Y J, YU S X, et al. Unsupervised feature learning via non-parametric instance discrimination[C]//Pro-ceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23,2018. Piscataway: IEEE, 2018: 3733-3742.
[29] TIAN Y L, KRISHNAN D, ISOLA P. Contrastive represen-tation distillation[J]. arXiv:1910.10699, 2019.
[30] XU G D, LIU Z W, LI X X, et al. Knowledge distillation meets self-supervision[C]//LNCS 12354: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 588-604.
[31] ZHU J G, TANG S X, CHEN D P, et al. Complementary relation contrastive distillation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Re-cognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 9260-9269.
[32] FANG G F, SONG J, SHEN C C, et al. Data-free adversarial distillation[J]. arXiv:1912.11006, 2019.
[33] MICAELLI P, STORKEY A J. Zero-shot knowledge transfer via adversarial belief matching[J]. arXiv:1905.09768, 2019.
[34] ALLEN-ZHU Z, LI Y Z. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning[J]. arXiv:2012.09816, 2020.
[35] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[36] ZAGORUYKO S, KOMODAKIS N. Wide residual networks[J]. arXiv:1605.07146, 2016.
[37] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2818-2826.
[38] SANDLER M, HOWARD A, ZHU M, et al. MobileNetv2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington:IEEE Computer Society, 2018: 4510-4520.
[39] LI F F, ROB F, PIETRO P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories[C]//Proceedings of the 2004 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2004: 178-178.
[40] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(11): 2580-2605.