Journal of Frontiers of Computer Science and Technology ›› 2023, Vol. 17 ›› Issue (11): 2689-2702.DOI: 10.3778/j.issn.1673-9418.2208032
• Graphics·Image • Previous Articles Next Articles
MA Jinlin, LIU Yuhao, MA Ziping, GONG Yuanwen, ZHU Yanbin
Online:
2023-11-01
Published:
2023-11-01
马金林,刘宇灏,马自萍,巩元文,朱艳彬
MA Jinlin, LIU Yuhao, MA Ziping, GONG Yuanwen, ZHU Yanbin. HSKDLR: Lightweight Lip Reading Method Based on Homogeneous Self-Knowledge Distillation[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(11): 2689-2702.
马金林, 刘宇灏, 马自萍, 巩元文, 朱艳彬. HSKDLR:同类自知识蒸馏的轻量化唇语识别方法[J]. 计算机科学与探索, 2023, 17(11): 2689-2702.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2208032
[1] 姚鸿勋, 高文, 王瑞, 等. 视觉语言——唇读综述[J]. 电子学报, 2001, 29(2): 239-246. YAO H X, GAO W, WANG R, et al. A survey of lipreading—one of visual languages[J]. Acta Electronica Sinica, 2001, 29(2): 239-246. [2] TAMURA S, NINOMIYA H, KITAOKA N, et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading[C]//Proceedings of the Asia- Pacific Signal and Information Processing Association Annual Summit and Conference, Hong Kong, China, Dec 16-19, 2015. Piscataway: IEEE, 2015: 575-582. [3] WATANABE T, KATSURADA K, KANAZAWA Y. Lip rea-ding from multi view facial images using 3D-AAM[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2017: 303-316. [4] BAART M, SAMUEL A G. Turning a blind eye to the lexi-con: ERPs show no cross-talk between lip-read and lexical context during speech sound processing[J]. Journal of Memory & Language, 2015, 85: 42-59. [5] LESANI F S, GHAZVINI F F, DIANAT R. Mobile phone security using automatic lip reading[C]//Proceedings of the 2015 International Conference on E-commerce in Develo-ping Countries: with Focus on E-business, Isfahan, Apr 16-16, 2015. Piscataway: IEEE, 2015. [6] MATHULAPRANGSAN S, WANG C Y, KUSUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the 2015 International Conference on Orange Technologies, Hong Kong, China, Dec 19-22, 2015. Piscataway: IEEE, 2015: 22-25. [7] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2015: 4945-4949. [8] HUANG J T, LI J Y, GONG Y F. An analysis of convo-lutional neural networks for speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Apr 19-24, 2015. Piscataway: IEEE, 2015: 4989-4993. [9] CHAE H, KANG C M, KIM B D, et al. Autonomous brak-ing system via deep reinforcement learning[C]//Proce-edings of the 20th IEEE International Conference on Intell-igent Transportation Systems, Yokohama, Oct 16-19, 2017. Piscataway: IEEE, 2017: 6. [10] PUVIARASAN N, PALANIVEL S. Lip reading of hearing impaired persons using HMM[J]. Expert Systems with Applications, 2011, 38(4): 4477-4481. [11] HONG X P, YAO H X, WAN Y Q, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//Proceedings of the 2nd International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Pasadena, Dec 18-20, 2006. Washington: IEEE Computer Society, 2006: 321-326. [12] 马金林, 朱艳彬, 马自萍, 等. 唇语识别的深度学习方法综述[J]. 计算机工程与应用, 2021, 57(24): 61-73. MA J L, ZHU Y B, MA Z P, et al. Review of deep learning methods for lip recognition[J]. Computer Engineering and Applications, 2021, 57(24): 61-73. [13] 马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13. MA J L, CHEN D G, GUO B B, et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13. [14] STAFYLAKIS T, TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]//Proceedings of the 18th Annual Conference of the International-Speech-Communication-Association, Stockholm, Aug 20-24, 2017: 3652-3656. [15] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778. [16] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: end-to-end sentence-level lipreading[J]. arXiv.1611. 01599, 2016. [17] WENG X, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]//Proc-eedings of the 2019 British Machine Vision Conference, Cardiff, May 4, 2019. [18] ZHANG Y H, YANG S, XIAO J, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway:IEEE, 2020: 356-363. [19] ZHAO X, YANG S, SHAN S, et al. Mutual information maximization for effective lip reading[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway: IEEE, 2020: 420-427. [20] MARTINEZ B, MA P, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Pisca-taway: IEEE, 2020: 6319-6323. [21] GEOFFREY H, ORIOL V, JEFF D. Distilling the knowledge in a neural network[J]. Computer Science, 2015, 14(7): 38-39. [22] LIU Y, SHUN C, WANG J, et al. Structured knowledge distillation for dense prediction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7035-7049. [23] YANG Z, LI Z, JIANG X, et al. Focal and global know-ledge distillation for detectors[J]. arXiv:2111.11837, 2021. [24] 张宸嘉, 朱磊, 俞璐. 卷积神经网络中的注意力机制综述[J]. 计算机工程与应用, 2021, 57(20): 64-72. ZHANG C J, ZHU L, YU L. Review of attention mecha-nism in convolutional neural networks[J]. Computer Engineering and Applications, 2021, 57(20): 64-72. [25] JIE H, LI S, GANG S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 42(8): 2011-2023. [26] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//LNCS 11211: Proceedings of the 15th European Conference on Computer Vision,Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19. [27] JIE H, LI S, ALBANIE S, et al. Gather-Excite: exploiting feature context in convolutional neural networks[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems, Montréal, Dec 3-8, 2018. Red Hook: Curran Associates, 2018: 9423-9433. [28] LINSLEY D, DAN S, EBERHARDT S, et al. Learning what and where to attend[C]//Proceedings of the 7th International Conference on Learning Representations, New Orleans, May 6-9, 2019: 1-21. [29] BELLO I, ZOPH B, LE Q, et al. Attention augmented convolutional networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 3285-3294. [30] MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]//Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, Jan 3-8, 2021. Piscataway: IEEE, 2021: 3138-3147. [31] HAN K, WANG Y, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 1577-1586. [32] MOBAHI H, FARAJTABAR M, BARTLETT P L. Self-distillation amplifies regularization in Hilbert space[J]. arXiv:2002.05715, 2020. [33] ZHANG Z L, SABUNCU M R. Self-distillation as instance-specific label smoothing[J]. arXiv:2006.05065, 2020. [34] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 11531-11539. [35] YUAN L, TAY F E, LI G, et al. Revisiting knowledge distillation via label smoothing regularization[C]//Proce-edings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 3902-3910. [36] MüLLER R, KORNBLITH S, HINTON G. When does label smoothing help?[C]//Proceedings of the 2019 Conf-erence and Workshop on Neural Information Processing Systems, Vancouver, Dec 8-14, 2019: 4696-4705. [37] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 3444-3453. [38] KING D E. Dlib-ml: a machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10(3): 1755-1758. [39] ZHANG H, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[J]. arXiv:1710.09412, 2017. [40] SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 4510-4520. [41] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//LNCS 11218: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 122-138. [42] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G. Push-ing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision & Image Understanding, 2018, 176/177: 22-32. [43] PETRIDIS S, STAFYLAKIS T, MA P, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the 2018 IEEE Spoken Language Technology Workshop, Athens, Dec 18-21, 2018. Piscat-away: IEEE, 2018: 513-520. [44] KIM M, HONG J, PARK S J, et al. Multi-modality asso-ciative bridging through memory: speech sound recol-lected from face video[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Mont-real, Oct 11-17, 2021. Piscataway: IEEE, 2021: 296-306. [45] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//LNCS 10112: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016.Cham: Springer, 2017: 87-103. [46] WANG C H. Multi-grained spatio-temporal modeling for lip-reading[C]//Proceedings of the 30th British Machine Vision Conference, Cardiff, Sep 9-12, 2019. Durham: BMVA Press, 2019: 276. [47] XU B, LU C, GUO Y, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 7168-7181. [48] WIRIYATHAMMABHUM P. SpotFast networks with memory augmented lateral transformers for lipreading[C]//Procee-dings of the 27th International Conference on Neural Information Processing, Bangkok, Nov 18-22, 2020: 554-561. [49] PAN X, CHEN P, GONG Y, et al. Leveraging unimodal self-supervised learning for multimodal audio-visual speech reco-gnition[J]. arXiv:2203.07996, 2022. |
[1] | LIN Zhenyuan, LIN Shaohui, YAO Yiwu, HE Gaoqi, WANG Changbo, MA Lizhuang. Multi-teacher Contrastive Knowledge Inversion for Data-Free Distillation [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(11): 2721-2733. |
[2] | MA Ziping, TAN Lidao, MA Jinlin, CHEN Yong. SMViT: Lightweight Siamese Masked Vision Transformer Model for Diagnosis of COVID-19 [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(10): 2499-2510. |
[3] | MA Jinlin, ZHANG Yu, MA Ziping, MAO Kaiji. Research Progress of Lightweight Neural Network Convolution Design [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 512-528. |
[4] | WANG Yanni, YU Lixian. SSD Object Detection Algorithm with Effective Fusion of Attention and Multi-scale [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 438-447. |
[5] | SHI Min, SHEN Jialin, YI Qingming, LUO Aiwen. Rapid and Ultra-lightweight Semantic Segmentation in Urban Traffic Scene [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(10): 2377-2386. |
[6] | WANG Dicong, BAI Chenshuai, WU Kaijun. Survey of Video Object Detection Based on Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(9): 1563-1577. |
[7] | SU Jiangyi, SONG Xiaoning, WU Xiaojun, YU Dongjun. Skeleton Based Action Recognition Algorithm on Multi-modal Lightweight Graph Convolutional Network [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(4): 733-742. |
[8] | XIAO Zhenjiu, YANG Xiaodi, WEI Xian, TANG Xiaoliang. Improved Lightweight Network in Image Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(4): 743-753. |
[9] | MENG Xianfa, LIU Fang, LI Guang, HUANG Mengmeng. Review of Knowledge Distillation in Convolutional Neural Network Compression [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1812-1829. |
[10] | LANG Lei, XIA Yingqing. Survey on Compact Neural Network Model Design [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1456-1470. |
[11] | ZHANG Dian, WANG Haitao, JIANG Ying, CHEN Xing. Research on Real-Time Face Recognition Algorithm Based on Lightweight Network [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(2): 317-324. |
[12] | SUN Ziwen, LI Song. Lightweight Authentication Protocol for Location Privacy Using PUF in Mobile RFID System [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(3): 418-428. |
[13] | WENG Hao, JIA Jinyuan. Intelligent Extraction of Tree L-system from a Single Image [J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(2): 145-151. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/