HSKDLR：同类自知识蒸馏的轻量化唇语识别方法

doi:10.3778/j.issn.1673-9418.2208032

摘要/Abstract

摘要： 针对唇语识别模型的识别率较低和计算量较大的问题，提出一种同类自知识蒸馏的轻量化唇语识别模型（HSKDLR）。首先，提出关注唇部图像空间特征的S-SE注意力模块，用其构建提取唇部图像通道特征和空间特征的i-Ghost Bottleneck模块，以提升唇语识别模型的准确率；其次，基于i-Ghost Bottleneck构建唇语识别模型，该模型通过优化瓶颈结构的组合方式降低模型计算量；然后，为提升模型准确率，减少模型运行时间，提出同类自知识蒸馏（HSKD）的模型训练方法；最后，使用同类自知识蒸馏方法训练唇语识别模型，并检验其识别性能。实验结果表明：与其他方法相比，HSKDLR具有更高的识别准确率和更低的计算量，在LRW数据集上的准确率达87.3%，浮点数运算量低至2.564 GFLOPs，参数量低至3.872 3×107；同类自知识蒸馏可被应用于大多数唇语识别模型，帮助其有效提升识别准确率，减少训练时间。

关键词: 唇语识别, 轻量化, 知识蒸馏, 自知识, Ghost Bottleneck

Abstract: In order to solve the problems of low recognition rate and large amount of calculation in lip reading, this paper proposes a lightweight model for lip reading named HSKDLR （homogeneous self-knowledge distillation for lip reading）. Firstly, the S-SE （spatial SE）attention module is designed to pay attention to the spatial features of the lip image, which can construct the i-Ghost Bottleneck （improved Ghost Bottleneck） module to extract the channel features and spatial features of the lip image, thereby improving the accuracy of the lip language recognition model. Secondly, a lip reading model is built based on i-Ghost Bottleneck, which reduces the model computation by optimizing the combination of bottleneck structures to a certain extent. Then, in order to improve the accuracy of the model and reduce time consumption, a model optimization method of the homogeneous self-knowledge distillation （HSKD） is proposed. Finally, this paper employs the HSKD to train the lip reading model and verify its recognition performance. And the experimental results show that HSKDLR has higher recognition accuracy and lower computational complexity than the compared methods. The accuracy of the proposed method on LRW dataset is 87.3%, the floating-point number computation is as low as 2.564 GFLOPs, and the parameter quantity is as low as 3.8723×107. Moreover, HSKD can be applied to most lip reading models to improve recognition accuracy effectively and reduce training time.

Key words: lip reading, lightweight, knowledge distillation, self-knowledge, Ghost Bottleneck

马金林, 刘宇灏, 马自萍, 巩元文, 朱艳彬. HSKDLR：同类自知识蒸馏的轻量化唇语识别方法[J]. 计算机科学与探索, 2023, 17(11): 2689-2702.

MA Jinlin, LIU Yuhao, MA Ziping, GONG Yuanwen, ZHU Yanbin. HSKDLR: Lightweight Lip Reading Method Based on Homogeneous Self-Knowledge Distillation[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(11): 2689-2702.

参考文献

[1] 姚鸿勋, 高文, 王瑞, 等. 视觉语言——唇读综述[J]. 电子学报, 2001, 29(2): 239-246.
YAO H X, GAO W, WANG R, et al. A survey of lipreading—one of visual languages[J]. Acta Electronica Sinica, 2001, 29(2): 239-246.
[2] TAMURA S, NINOMIYA H, KITAOKA N, et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading[C]//Proceedings of the Asia- Pacific Signal and Information Processing Association Annual Summit and Conference, Hong Kong, China, Dec 16-19, 2015. Piscataway: IEEE, 2015: 575-582.
[3] WATANABE T, KATSURADA K, KANAZAWA Y. Lip rea-ding from multi view facial images using 3D-AAM[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2017: 303-316.
[4] BAART M, SAMUEL A G. Turning a blind eye to the lexi-con: ERPs show no cross-talk between lip-read and lexical context during speech sound processing[J]. Journal of Memory & Language, 2015, 85: 42-59.
[5] LESANI F S, GHAZVINI F F, DIANAT R. Mobile phone security using automatic lip reading[C]//Proceedings of the 2015 International Conference on E-commerce in Develo-ping Countries: with Focus on E-business, Isfahan, Apr 16-16, 2015. Piscataway: IEEE, 2015.
[6] MATHULAPRANGSAN S, WANG C Y, KUSUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the 2015 International Conference on Orange Technologies, Hong Kong, China, Dec 19-22, 2015. Piscataway: IEEE, 2015: 22-25.
[7] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2015: 4945-4949.
[8] HUANG J T, LI J Y, GONG Y F. An analysis of convo-lutional neural networks for speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Apr 19-24, 2015. Piscataway: IEEE, 2015: 4989-4993.
[9] CHAE H, KANG C M, KIM B D, et al. Autonomous brak-ing system via deep reinforcement learning[C]//Proce-edings of the 20th IEEE International Conference on Intell-igent Transportation Systems, Yokohama, Oct 16-19, 2017. Piscataway: IEEE, 2017: 6.
[10] PUVIARASAN N, PALANIVEL S. Lip reading of hearing impaired persons using HMM[J]. Expert Systems with Applications, 2011, 38(4): 4477-4481.
[11] HONG X P, YAO H X, WAN Y Q, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//Proceedings of the 2nd International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Pasadena, Dec 18-20, 2006. Washington: IEEE Computer Society, 2006: 321-326.
[12] 马金林, 朱艳彬, 马自萍, 等. 唇语识别的深度学习方法综述[J]. 计算机工程与应用, 2021, 57(24): 61-73.
MA J L, ZHU Y B, MA Z P, et al. Review of deep learning methods for lip recognition[J]. Computer Engineering and Applications, 2021, 57(24): 61-73.
[13] 马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13.
MA J L, CHEN D G, GUO B B, et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13.
[14] STAFYLAKIS T, TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]//Proceedings of the 18th Annual Conference of the International-Speech-Communication-Association, Stockholm, Aug 20-24, 2017: 3652-3656.
[15] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[16] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: end-to-end sentence-level lipreading[J]. arXiv.1611. 01599， 2016.
[17] WENG X, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]//Proc-eedings of the 2019 British Machine Vision Conference, Cardiff, May 4, 2019.
[18] ZHANG Y H, YANG S, XIAO J, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway:IEEE, 2020: 356-363.
[19] ZHAO X, YANG S, SHAN S, et al. Mutual information maximization for effective lip reading[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway: IEEE, 2020: 420-427.
[20] MARTINEZ B, MA P, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Pisca-taway: IEEE, 2020: 6319-6323.
[21] GEOFFREY H, ORIOL V, JEFF D. Distilling the knowledge in a neural network[J]. Computer Science, 2015, 14(7): 38-39.
[22] LIU Y, SHUN C, WANG J, et al. Structured knowledge distillation for dense prediction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7035-7049.
[23] YANG Z, LI Z, JIANG X, et al. Focal and global know-ledge distillation for detectors[J]. arXiv:2111.11837, 2021.
[24] 张宸嘉, 朱磊, 俞璐. 卷积神经网络中的注意力机制综述[J]. 计算机工程与应用, 2021, 57(20): 64-72.
ZHANG C J, ZHU L, YU L. Review of attention mecha-nism in convolutional neural networks[J]. Computer Engineering and Applications, 2021, 57(20): 64-72.
[25] JIE H, LI S, GANG S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 42(8): 2011-2023.
[26] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//LNCS 11211: Proceedings of the 15th European Conference on Computer Vision,Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[27] JIE H, LI S, ALBANIE S, et al. Gather-Excite: exploiting feature context in convolutional neural networks[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems, Montréal, Dec 3-8, 2018. Red Hook: Curran Associates, 2018: 9423-9433.
[28] LINSLEY D, DAN S, EBERHARDT S, et al. Learning what and where to attend[C]//Proceedings of the 7th International Conference on Learning Representations, New Orleans, May 6-9, 2019: 1-21.
[29] BELLO I, ZOPH B, LE Q, et al. Attention augmented convolutional networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 3285-3294.
[30] MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]//Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, Jan 3-8, 2021. Piscataway: IEEE, 2021: 3138-3147.
[31] HAN K, WANG Y, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 1577-1586.
[32] MOBAHI H, FARAJTABAR M, BARTLETT P L. Self-distillation amplifies regularization in Hilbert space[J]. arXiv:2002.05715, 2020.
[33] ZHANG Z L, SABUNCU M R. Self-distillation as instance-specific label smoothing[J]. arXiv:2006.05065, 2020.
[34] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 11531-11539.
[35] YUAN L, TAY F E, LI G, et al. Revisiting knowledge distillation via label smoothing regularization[C]//Proce-edings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 3902-3910.
[36] MüLLER R, KORNBLITH S, HINTON G. When does label smoothing help?[C]//Proceedings of the 2019 Conf-erence and Workshop on Neural Information Processing Systems, Vancouver, Dec 8-14, 2019: 4696-4705.
[37] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 3444-3453.
[38] KING D E. Dlib-ml: a machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10(3): 1755-1758.
[39] ZHANG H, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[J]. arXiv:1710.09412, 2017.
[40] SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 4510-4520.
[41] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//LNCS 11218: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 122-138.
[42] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G. Push-ing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision & Image Understanding, 2018, 176/177: 22-32.
[43] PETRIDIS S, STAFYLAKIS T, MA P, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the 2018 IEEE Spoken Language Technology Workshop, Athens, Dec 18-21, 2018. Piscat-away: IEEE, 2018: 513-520.
[44] KIM M, HONG J, PARK S J, et al. Multi-modality asso-ciative bridging through memory: speech sound recol-lected from face video[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Mont-real, Oct 11-17, 2021. Piscataway: IEEE, 2021: 296-306.
[45] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//LNCS 10112: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016.Cham: Springer, 2017: 87-103.
[46] WANG C H. Multi-grained spatio-temporal modeling for lip-reading[C]//Proceedings of the 30th British Machine Vision Conference, Cardiff, Sep 9-12, 2019. Durham: BMVA Press, 2019: 276.
[47] XU B, LU C, GUO Y, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 7168-7181.
[48] WIRIYATHAMMABHUM P. SpotFast networks with memory augmented lateral transformers for lipreading[C]//Procee-dings of the 27th International Conference on Neural Information Processing, Bangkok, Nov 18-22, 2020: 554-561.
[49] PAN X, CHEN P, GONG Y, et al. Leveraging unimodal self-supervised learning for multimodal audio-visual speech reco-gnition[J]. arXiv:2203.07996, 2022.