[1] 姚鸿勋, 高文, 王瑞, 等. 视觉语言——唇读综述[J]. 电子学报, 2001, 29(2): 239-246.
YAO H X, GAO W, WANG R, et al. A survey of lipreading—one of visual languages[J]. Acta Electronica Sinica, 2001, 29(2): 239-246.
[2] TAMURA S, NINOMIYA H, KITAOKA N, et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading[C]//Proceedings of the Asia- Pacific Signal and Information Processing Association Annual Summit and Conference, Hong Kong, China, Dec 16-19, 2015. Piscataway: IEEE, 2015: 575-582.
[3] WATANABE T, KATSURADA K, KANAZAWA Y. Lip rea-ding from multi view facial images using 3D-AAM[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2017: 303-316.
[4] BAART M, SAMUEL A G. Turning a blind eye to the lexi-con: ERPs show no cross-talk between lip-read and lexical context during speech sound processing[J]. Journal of Memory & Language, 2015, 85: 42-59.
[5] LESANI F S, GHAZVINI F F, DIANAT R. Mobile phone security using automatic lip reading[C]//Proceedings of the 2015 International Conference on E-commerce in Develo-ping Countries: with Focus on E-business, Isfahan, Apr 16-16, 2015. Piscataway: IEEE, 2015.
[6] MATHULAPRANGSAN S, WANG C Y, KUSUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the 2015 International Conference on Orange Technologies, Hong Kong, China, Dec 19-22, 2015. Piscataway: IEEE, 2015: 22-25.
[7] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2015: 4945-4949.
[8] HUANG J T, LI J Y, GONG Y F. An analysis of convo-lutional neural networks for speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Apr 19-24, 2015. Piscataway: IEEE, 2015: 4989-4993.
[9] CHAE H, KANG C M, KIM B D, et al. Autonomous brak-ing system via deep reinforcement learning[C]//Proce-edings of the 20th IEEE International Conference on Intell-igent Transportation Systems, Yokohama, Oct 16-19, 2017. Piscataway: IEEE, 2017: 6.
[10] PUVIARASAN N, PALANIVEL S. Lip reading of hearing impaired persons using HMM[J]. Expert Systems with Applications, 2011, 38(4): 4477-4481.
[11] HONG X P, YAO H X, WAN Y Q, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//Proceedings of the 2nd International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Pasadena, Dec 18-20, 2006. Washington: IEEE Computer Society, 2006: 321-326.
[12] 马金林, 朱艳彬, 马自萍, 等. 唇语识别的深度学习方法综述[J]. 计算机工程与应用, 2021, 57(24): 61-73.
MA J L, ZHU Y B, MA Z P, et al. Review of deep learning methods for lip recognition[J]. Computer Engineering and Applications, 2021, 57(24): 61-73.
[13] 马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13.
MA J L, CHEN D G, GUO B B, et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13.
[14] STAFYLAKIS T, TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]//Proceedings of the 18th Annual Conference of the International-Speech-Communication-Association, Stockholm, Aug 20-24, 2017: 3652-3656.
[15] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[16] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: end-to-end sentence-level lipreading[J]. arXiv.1611. 01599, 2016.
[17] WENG X, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[C]//Proc-eedings of the 2019 British Machine Vision Conference, Cardiff, May 4, 2019.
[18] ZHANG Y H, YANG S, XIAO J, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway:IEEE, 2020: 356-363.
[19] ZHAO X, YANG S, SHAN S, et al. Mutual information maximization for effective lip reading[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway: IEEE, 2020: 420-427.
[20] MARTINEZ B, MA P, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Pisca-taway: IEEE, 2020: 6319-6323.
[21] GEOFFREY H, ORIOL V, JEFF D. Distilling the knowledge in a neural network[J]. Computer Science, 2015, 14(7): 38-39.
[22] LIU Y, SHUN C, WANG J, et al. Structured knowledge distillation for dense prediction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7035-7049.
[23] YANG Z, LI Z, JIANG X, et al. Focal and global know-ledge distillation for detectors[J]. arXiv:2111.11837, 2021.
[24] 张宸嘉, 朱磊, 俞璐. 卷积神经网络中的注意力机制综述[J]. 计算机工程与应用, 2021, 57(20): 64-72.
ZHANG C J, ZHU L, YU L. Review of attention mecha-nism in convolutional neural networks[J]. Computer Engineering and Applications, 2021, 57(20): 64-72.
[25] JIE H, LI S, GANG S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 42(8): 2011-2023.
[26] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//LNCS 11211: Proceedings of the 15th European Conference on Computer Vision,Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[27] JIE H, LI S, ALBANIE S, et al. Gather-Excite: exploiting feature context in convolutional neural networks[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems, Montréal, Dec 3-8, 2018. Red Hook: Curran Associates, 2018: 9423-9433.
[28] LINSLEY D, DAN S, EBERHARDT S, et al. Learning what and where to attend[C]//Proceedings of the 7th International Conference on Learning Representations, New Orleans, May 6-9, 2019: 1-21.
[29] BELLO I, ZOPH B, LE Q, et al. Attention augmented convolutional networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 3285-3294.
[30] MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]//Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, Jan 3-8, 2021. Piscataway: IEEE, 2021: 3138-3147.
[31] HAN K, WANG Y, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 1577-1586.
[32] MOBAHI H, FARAJTABAR M, BARTLETT P L. Self-distillation amplifies regularization in Hilbert space[J]. arXiv:2002.05715, 2020.
[33] ZHANG Z L, SABUNCU M R. Self-distillation as instance-specific label smoothing[J]. arXiv:2006.05065, 2020.
[34] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 11531-11539.
[35] YUAN L, TAY F E, LI G, et al. Revisiting knowledge distillation via label smoothing regularization[C]//Proce-edings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 3902-3910.
[36] MüLLER R, KORNBLITH S, HINTON G. When does label smoothing help?[C]//Proceedings of the 2019 Conf-erence and Workshop on Neural Information Processing Systems, Vancouver, Dec 8-14, 2019: 4696-4705.
[37] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 3444-3453.
[38] KING D E. Dlib-ml: a machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10(3): 1755-1758.
[39] ZHANG H, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[J]. arXiv:1710.09412, 2017.
[40] SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 4510-4520.
[41] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//LNCS 11218: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 122-138.
[42] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G. Push-ing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision & Image Understanding, 2018, 176/177: 22-32.
[43] PETRIDIS S, STAFYLAKIS T, MA P, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the 2018 IEEE Spoken Language Technology Workshop, Athens, Dec 18-21, 2018. Piscat-away: IEEE, 2018: 513-520.
[44] KIM M, HONG J, PARK S J, et al. Multi-modality asso-ciative bridging through memory: speech sound recol-lected from face video[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Mont-real, Oct 11-17, 2021. Piscataway: IEEE, 2021: 296-306.
[45] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//LNCS 10112: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016.Cham: Springer, 2017: 87-103.
[46] WANG C H. Multi-grained spatio-temporal modeling for lip-reading[C]//Proceedings of the 30th British Machine Vision Conference, Cardiff, Sep 9-12, 2019. Durham: BMVA Press, 2019: 276.
[47] XU B, LU C, GUO Y, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 7168-7181.
[48] WIRIYATHAMMABHUM P. SpotFast networks with memory augmented lateral transformers for lipreading[C]//Procee-dings of the 27th International Conference on Neural Information Processing, Bangkok, Nov 18-22, 2020: 554-561.
[49] PAN X, CHEN P, GONG Y, et al. Leveraging unimodal self-supervised learning for multimodal audio-visual speech reco-gnition[J]. arXiv:2203.07996, 2022. |