唇语识别的视觉特征提取方法综述

doi:10.3778/j.issn.1673-9418.2106105

摘要/Abstract

摘要：

现有唇语识别研究多专注于提高识别精度、研究多模态输入特征等方面，对提高唇部视觉特征的有效性关注不多。而唇部的视觉信息在视觉语音识别和唇语识别中起着关键作用，尤其在音频被破坏或无音频信息时，唇部视觉信息尤为重要。如何获取准确有效的唇部视觉特征是当前唇语识别的难点工作之一。从唇语数据集、传统视觉特征提取方法、视觉特征提取的深度学习方法三方面综述了唇语识别方向近年来的最新研究工作：首先，总结了唇语识别数据集，将唇语数据集分为正视图和多视图两种类型，并总结整理两类数据集的特点、局限性和下载地址；其次，从像素点、形状和混合特征的角度介绍了唇部视觉特征提取的传统方法，重点介绍各方法的基本思想、网络结构和特点；然后，介绍了唇部视觉特征提取的深度学习方法，重点介绍 2D CNN、3D CNN、2D CNN与3D CNN相结合、其他神经网络四种深度学习方法的网络结构和优缺点，并比较了这些方法在公开数据集上的性能表现；最后，对唇部视觉特征提取方法所面临的挑战和未来研究趋势进行了展望。

关键词: 唇语识别, 视觉特征, 深度学习

Abstract:

Current research on lip recognition focuses on improving recognition accuracy and studying features of multimodal inputs. However, little attention has been paid to improving the effectiveness of lip visual features. Lip visual information plays a key role in visual speech recognition and lip recognition. It is important when audio is destroyed or has no information. How to obtain accurate and effective lip visual features is one of the most difficult tasks in lip recognition. This paper reviews the latest research work on lip recognition in recent years from three aspects: lip dataset, traditional visual feature extraction methods, and in-depth learning methods for visual feature extraction. Firstly, this paper summarizes the dataset for lip recognition. The lip dataset is divided into two types: front view and multi-view. Further two types of datasets are summarized from their characteristics, limitations, and download addresses. Secondly, this paper introduces the traditional methods of lip visual feature extraction from the perspective of pixel point, shape and mixed features. The basic idea, network structure and features of each method are mainly introduced. In the deep learning method of lip visual feature extraction, the network structure, advantages and disadvantages of four deep learning methods are mainly introduced, such as 2D CNN, 3D CNN, 2D CNN combined with 3D CNN, and other neural networks. The performance of these methods on open datasets is compared. Finally, the challenges faced by lip visual feature extraction methods and future research trends are prospected.

Key words: lip recognition, visual feature, deep learning

马金林, 巩元文, 马自萍, 陈德光, 朱艳彬, 刘宇灏. 唇语识别的视觉特征提取方法综述[J]. 计算机科学与探索, 2021, 15(12): 2256-2275.

MA Jinlin, GONG Yuanwen, MA Ziping, CHEN Deguang, ZHU Yanbin, LIU Yuhao. Review of Extracting Methods for Lip Visual Features[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(12): 2256-2275.

参考文献

[1] ZHANG X B, GONG H G, YANG F, et al. Chinese sentence-level lip reading based on end-to-end model[J]. Journal of Software, 2020, 31(6): 1747-1760.
张晓冰, 龚海刚, 杨帆, 等. 基于端到端句子级别的中文唇语识别研究[J]. 软件学报, 2020, 31(6): 1747-1760.
[2] ZHOU Z H, ZHAO G Y, HONG X P, et al. A review of re-cent advances in visual speech decoding[J]. Image and Vision Computing, 2014, 32(9): 590-605.
[3] MATHULAPRANGSAN S, WANG C Y, USUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the 2015 International Conference on Orange Technologies, Hong Kong, China, Dec 19-22, 2015. Piscataway: IEEE, 2015: 22-25.
[4] WANG M. Lip feature selection based on BPSO and SVM[C]//Proceedings of the 2011 IEEE 10th International Con-ference on Electronic Measurement & Instruments, Chengdu, Aug 16-19, 2011. Piscataway: IEEE, 2011: 56-59.
[5] LIU H, FAN T, WU P P. Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction[C]//Proceedings of the 2014 IEEE International Conference on Robotics and Automation, Hong Kong, China, May 31-Jun 7, 2014. Piscataway: IEEE, 2014: 6644-6651.
[6] MA J L, CHEN D G, GUO B B, et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13.
马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13.
[7] MATTHEWS I, COOTES T F, BANGHAM J A, et al. Ext-raction of visual features for lipreading[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 198-213.
[8] COOKE M, BARKER J, CUNNINGHAM S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424.
[9] ZHAO G Y, BARNARD M, PIETIKAINEN M, et al. Lip-reading with local spatiotemporal descriptors[J]. IEEE Trans-actions on Multimedia, 2009, 11(7): 1254-1265.
[10] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//LNCS 10112: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 87-103.
[11] JING H, POTAMIANOS G, CONNELL J, et al. Audio-visual speech recognition using an infrared headset[J]. Speech Communication, 2004, 44: 83-96.
[12] MCCOOL C, LEVY C, MATROUF D, et al. Bi-modal per-son recognition on a mobile phone: using mobile phone data[C]//Proceedings of the 2012 IEEE International Con-ference on Multimedia and Expo Workshops, Melbourne, Jul 9-13, 2012. Washington: IEEE Computer Society, 2012: 635-640.
[13] LAN Y X, THEOBALD B J, HARVEY R, et al. View inde-pendent computer lip-reading[C]//Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, Melbourne, Jul 9-13, 2012. Washington: IEEE Computer Society, 2012: 432-437.
[14] KUMAR K, CHEN T, STERN R M, et al. Profile view lip reading[C]//Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, Apr 15-20, 2007. Piscataway: IEEE, 2007: 429-432.
[15] PATTERSON E K, GURBUZ S, TUFEKCI Z, et al. CUAVE: a new audio-visual database for multimodal human-computer interface research[C]//Proceedings of the 2002 IEEE Inter-national Conference on Acoustics, Speech, and Signal Pro-cessing, Orlando, May 13-17, 2002. Piscataway: IEEE, 2002: 2017-2020.
[16] LAN Y X, THEOBALD B J, HARVEY R W, et al. Impro-ving visual features for lip-reading[C]//Proceedings of the Auditory-Visual Speech Processing, Hakone, Sep 30-Oct 3, 2010.
[17] ESTELLERS V, THIRAN J P. Multi-pose audio-visual speech recognition[C]//Proceedings of the 19th European Signal Processing Conference, Barcelona, Aug 29-Sep 2, 2011. Piscataway: IEEE, 2011: 1065-1069.
[18] ANINA I, ZHOU Z H, ZHAO G Y, et al. OuluVS2: a multi-view audiovisual database for non-rigid mouth motion ana-lysis[C]//Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, Ljubljana, May 4-8, 2015. Washington: IEEE Computer Society, 2015: 1-5.
[19] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3444-3453.
[20] AFOURAS T, CHUNG J S, ZISSERMAN A, et al. LRS3-TED: a large-scale dataset for visual speech recognition[J]. arXiv:1809.00496, 2018.
[21] YANG S, ZHANG Y H, FENG D L, et al. LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild[C]//Proceedings of the 2019 14th IEEE Interna-tional Conference on Automatic Face & Gesture Recogni-tion, Lille, May 14-18, 2019. Piscataway: IEEE, 2019: 1-8.
[22] RONG C Z, YUE Z J, JIA Y X, et al. Research advances in key technology of lip-reading[J]. Journal of Data Acqui-sition and Processing, 2012(S2): 277-283.
荣传振, 岳振军, 贾永兴, 等. 唇语识别关键技术研究进展[J]. 数据采集与处理, 2012(S2): 277-283.
[23] DUPONT S, LUETTIN J. Audio-visual speech modeling for continuous speech recognition[J]. IEEE Transactions on Multimedia, 2000, 2(3): 141-151.
[24] POTAMIANOS G, LUETTIN J, NETI C, et al. Hierarchical discriminant features for audio-visual LVCSR[C]//Procee-dings of the 2001 IEEE International Conference on Acou-stics, Speech, and Signal Processing, Salt Lake City, May 7-11, 2001. Piscataway: IEEE, 2001: 165-168.
[25] MARCHERET E, LIBAL V, POTAMIANOS G, et al. Dyna-mic stream weight modeling for audio-visual speech recog-nition[C]//Proceedings of the 2007 IEEE International Con-ference on Acoustics, Speech, and Signal Processing, Hono-lulu, Apr 15-20, 2007. Piscataway: IEEE, 2007: 945-948.
[26] ALMAJAI I, COX S J, HARVEY R W, et al. Improved speaker independent lip reading using speaker adaptive training and deep neural networks[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, Mar 20-25, 2016. Piscata-way: IEEE, 2016: 2722-2726.
[27] SHAIKH A A, KUMAR D K, YAU W C, et al. Lip reading using optical flow and support vector machines[C]//Pro-ceedings of the 3rd International Congress on Image and Signal Processing, Yantai, Oct 16-18, 2010. Piscataway: IEEE, 2010: 327-330.
[28] OJALA T, PIETIKAINEN M, MAENPAA T, et al. Multi-resolution gray-scale and rotation invariant texture classifica-tion with local binary patterns[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7): 971-987.
[29] ZHAO G, PIETIKAINEN M. Dynamic texture recognition using local binary patterns with an application to facial expressions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 915-928.
[30] ZHOU Z H, ZHAO G Y, PIETIK?INEN M, et al. Towards a practical lipreading system[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recogni-tion, Colorado Springs, Jun 20-25, 2011. Washington: IEEE Computer Society, 2011: 137-144.
[31] REKIK A, BEN-HAMADOU A, MAHDI W, et al. An adap-tive approach for lip-reading using image and depth data[J]. Multimedia Tools and Applications, 2016, 75(14): 8609-8636.
[32] LI M, CHEUNG Y M. A novel motion based lip feature extraction for lip-reading[C]//Proceedings of the 2008 Inter-national Conference on Computational Intelligence and Security, Suzhou, Dec 13-17, 2008. Washington: IEEE Com-puter Society, 2008: 361-365.
[33] ALIZADEH S, BOOSTANI R, ASADPOUR V, et al. Lip feature extraction and reduction for HMM-based visual speech recognition systems[C]//Proceedings of the 9th International Conference on Signal Processing, Beijing, Oct 26-29, 2008. Piscataway: IEEE, 2008: 561-564.
[34] MA X J, YAN L, ZHONG Q Y, et al. Lip feature extraction based on improved jumping-snake model[C]//Proceedings of the 2016 35th Chinese Control Conference, Chengdu, Jul 27-29, 2016. Piscataway: IEEE, 2016: 6928-6933.
[35] KASS M, WITKIN A, TERAOPOULOS D, et al. Snakes: active contour models[J]. International Journal of Computer Vision, 1988, 1(4): 321-331.
[36] COOTES T F, TAYLOR C J, COOPER D H, et al. Active shape models-their training and application[J]. Computer Vision & Image Understanding, 1995, 61(1): 38-59.
[37] CHEN J Y, TIDDEMAN B, ZHAO G, et al. Real-time lip contour extraction and tracking using an improved active contour model[C]//LNCS 5359: Proceedings of the 4th International Symposium on Visual Computing, Las Vegas, Dec 1-3, 2008. Berlin, Heidelberg: Springer, 2008: 236-245.
[38] COOTES T F, EDWARDS G J, TAYLOR C J, et al. Active appearance models[J]. IEEE Transactions on Pattern Ana-lysis and Machine Intelligence, 2001, 23(6): 681-685.
[39] LAN Y X, HARVEY R W, THEOBALD B J, et al. Insights into machine lip reading[C]//Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Mar 25-30, 2012. Piscataway: IEEE, 2012: 4825-4828.
[40] WATANABE T, KATSURADA K, KANAZAWA Y, et al. Lip reading from multi view facial images using 3D-AAM[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 303-316.
[41] ALEKSIC P S, KATSAGGELOS A K. Audio-visual biome-trics[J]. Proceedings of the IEEE, 2006, 94(11): 2025-2044.
[42] STILLITTANO S, GIRONDEL V, CAPLIER A, et al. Lip contour segmentation and tracking compliant with lip-reading application constraints[J]. Machine Vision and Applications, 2013, 24(1): 1-18.
[43] LUCEY P, POTAMIANOS G, SRIDHARAN S, et al. A uni-fied approach to multi-pose audio-visual ASR[C]//Procee-dings of the 8th Annual Conference of the International Speech Communication Association, Antwerp, Aug 27-31, 2007: 650-653.
[44] GURBAN M, THIRAN J P. Information theoretic feature extraction for audio-visual speech recognition[J]. IEEE Trans-actions on Signal Processing, 2009, 57(12): 4765-4776.
[45] NAVARATHNA R, KLEINSCHMIDT T, DEAN D, et al. Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment?[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Aug 27-31, 2011: 2241-2244.
[46] ESTELLERS V, THIRAN J P. Multi-pose lip reading and audio-visual speech recognition[J]. EURASIP Journal on Advances in Signal Processing, 2012(1): 1-23.
[47] LEE D, LEE J, KIM K E, et al. Multi-view automatic lip-reading using neural network[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 290-302.
[48] FERNANDEZ-LOPEZ A, MARTINEZ O, SUKNO F M, et al. Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database[C]//Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition, Washington, May 30-Jun 3, 2017. Washington: IEEE Computer Society, 2017: 208-215.
[49] GOLDSCHEN A J, GARCIA O N, PETAJAN E, et al. Con-tinuous optical automatic speech recognition by lipreading[C]//Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, Oct 31-Nov 2, 1994. Washington: IEEE Computer Society, 1994: 572-577.
[50] CAPPELLETTA L, HARTE N. Viseme definitions compa-rison for visual-only speech recognition[C]//Proceedings of the 19th European Signal Processing Conference, Barcelona, Aug 29-Sep 2, 2011. Piscataway: IEEE, 2011: 2109-2113.
[51] WAND M, KOUTNIK J, SCHMIDHUBER J, et al. Lip rea-ding with long short-term memory[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, Mar 20-25, 2016. Piscata-way: IEEE, 2016: 6115-6119.
[52] RAHMANI M H, ALMASGANJ F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features[C]//Proceedings of the 2017 3rd International Conference on Pattern Recognition and Image Analysis, Shahrekord, Apr 19-20, 2017. Piscataway: IEEE, 2017: 195-199.
[53] WANG S L, LIEW A W C, LAU W H, et al. An automatic lip reading system for spoken digits with limited training data[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(12): 1760-1765.
[54] NODA K, YAMAGUCHI Y, NAKADAI K, et al. Lipreading using convolutional neural network[C]//Proceedings of the 15th Annual Conference of the International Speech Com-munication Association, Singapore, Sep 14-18, 2014: 1149-1153.
[55] GARG A, NOYOLA J, BAGADIA S. Lip reading using CNN and LSTM[R]. Stanford University, 2016.
[56] NODA K, YAMAGUCHI Y, NAKADAI K, et al. Audio-visual speech recognition using deep learning[J]. Applied Intelligence, 2015, 42(4): 722-737.
[57] ZHOU P, YANG W, CHEN W, et al. Modality attention for end-to-end audio-visual speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 6565-6569.
[58] SAITOH T, ZHOU Z H, ZHAO G Y, et al. Concatenated frame image based CNN for visual speech recognition[C]//LNCS 10117: Proceedings of the Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 277-289.
[59] LIN M, CHEN Q, YAN S. Network in network[J]. arXiv:1312.4400, 2013.
[60] MESBAH A, BERRAHOU A, HAMMOUCHI H, et al. Lip reading with Hahn convolutional neural networks[J]. Image and Vision Computing, 2019, 88: 76-83.
[61] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: sentence-level lipreading[J]. arXiv:1611.01599, 2016.
[62] FUNG I, MAK B. End-to-end low-resource lip-reading with maxout CNN and LSTM[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 2511-2515.
[63] XU K, LI D, CASSIMATIS N, et al. LCANet: end-to-end lip reading with cascaded attention-CTC[C]//Proceedings of the 2018 13th IEEE International Conference on Auto-matic Face & Gesture Recognition, Xi??an, May 15-19, 2018. Piscataway: IEEE, 2018: 548-555.
[64] WENG X, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lip reading[J]. arXiv:1905.02540, 2019.
[65] WIRIYATHAMMABHUM P. SpotFast networks with me-mory augmented lateral transformers for lipreading[C]//Pro-ceedings of the 27th International Conference on Neural Information Processing, Bangkok, Nov 18-22, 2020. Cham: Springer, 2020: 554-561.
[66] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G, et al. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176: 22-32.
[67] FENG D, YANG S, SHAN S, et al. Learn an effective lip reading model without pains[J]. arXiv:2011.07557, 2020.
[68] AFOURAS T, CHUNG J S, ZISSERMAN A, et al. My lips are concealed: audio-visual speech enhancement through obstructions[J]. arXiv:1907.04975, 2019.
[69] XU B, LU C, GUO Y D, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 14433-14442.
[70] LUO M, YANG S, SHAN S, et al. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading[J]. arXiv:2003.03983, 2020.
[71] XIAO J, YANG S, ZHANG Y, et al. Deformation flow based two-stream network for lip reading[J]. arXiv:2003.05709, 2020.
[72] ZHAO X, YANG S, SHAN S, et al. Mutual information ma-ximization for effective lip reading[J]. arXiv:2003.06439, 2020.
[73] PETRIDIS S, WANG Y, LI Z, et al. End-to-end audiovisual fusion with LSTMs[J]. arXiv:1709.04343, 2017.
[74] PETRIDIS S, LI Z, PANTIC M. End-to-end visual speech recognition with LSTMs[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Mar 5-9, 2017. Piscataway: IEEE, 2017: 2592-2596.
[75] PETRIDIS S, WANG Y, LI Z, et al. End-to-end multi-view lip reading[J]. arXiv:1709.00443, 2017.
[76] PETRIDIS S, SHEN J, CETIN D, et al. Visual-only reco-gnition of normal, whispered and silent speech[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 6219-6223.
[77] WAND M, SCHMIDHUBER J. Improving speaker-indepen-dent lipreading with domain-adversarial training[J]. arXiv:1708.01565, 2017.
[78] WAND M, SCHMIDHUBER J, VU N T, et al. Investigations on end-to-end audiovisual fusion[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscata-way: IEEE, 2018: 3041-3045.
[79] MOON S, KIM S, WANG H, et al. Multimodal transfer deep learning with applications in audio-visual recognition[J]. arXiv:1412.3121, 2014.
[80] LI Y, TAKASHIMA Y, TAKIGUCHI T, et al. Lip reading using a dynamic feature of lip images and convolutional neural networks[C]//Proceedings of the 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, Okayama, Jun 26-29, 2016. Piscataway: IEEE, 2016: 1-6.
[81] CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//LNCS 10117: Proceedings of the 2016 Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 251-263.
[82] GUTIERREZ A, ROBERT Z. Lip reading word classifica-tion[R]. Stanford University, 2017.
[83] CHUNG J S, ZISSERMAN A. Learning to lip read words by watching videos[J]. Computer Vision and Image Under-standing, 2018, 173: 76-85.
[84] OLIVEIRA D A B, MATTOS A B, MORAIS E, et al. Im-proving viseme recognition using GAN-based frontal view mapping[C]//Proceedings of the 2018 IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition Work-shops, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 2148-2155.
[85] NADEEMHASHMI S, GUPTA H, MITTAL D, et al. A lip reading model using CNN with batch normalization[C]//Proceedings of the 2018 11th International Conference on Contemporary Computing , Noida, Aug 2-4, 2018. Piscata-way: IEEE, 2018: 1-6.
[86] JHA A, NAMBOODIRI V P, JAWAHAR C V, et al. Word spotting in silent lip videos[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Piscataway: IEEE, 2018: 150-159.
[87] MATTOS A B, OLIVEIRA D A B, MORAIS E D S, et al. Improving CNN-based viseme recognition using synthetic data[C]//Proceedings of the 2018 IEEE International Con-ference on Multimedia and Expo, San Diego, Jul 23-27, 2018. Piscataway: IEEE, 2018: 1-6.
[88] ZHANG X B, GONG H G, DAI X L, et al. Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 9211-9218.
[89] ZHAO Y, XU R, WANG X, et al. Hearing lips: improving lip reading by distilling speech recognizers[C]//Proceedings of the 2020 AAAI Conference on Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 6917-6924.
[90] ZHAO Y, XU R, SONG M L. A cascade sequence-to-sequ-ence model for Chinese mandarin lip reading[C]//Procee-dings of the ACM Multimedia Asia, Beijing, Dec 16-18, 2019. New York: ACM, 2019: 1-6.
[91] TORFI A, IRANMANESH S M, NASRABADI N, et al. 3D convolutional neural networks for cross audio-visual mat-ching recognition[J]. IEEE Access, 2017, 5: 22081-22091.
[92] SHILLINGFORD B, ASSAEL Y, HOFFMAN M W, et al. Large-scale visual speech recognition[J]. arXiv:1807.05162, 2018.
[93] KUMAR Y, JAIN R, SALIK K M, et al. Lipper: synthesizing thy speech using multi-view lipreading[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Ho-nolulu, Jan 27-Feb 1, 2019. New York: ACM, 2019: 2588-2595.
[94] LIU J L, REN Y, ZHAO Z, et al. FastLR: non-autoregressive lipreading model with integrate-and-fire[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 4328-4336.
[95] STAFYLAKIS T, TZIMIROPOULOS G. Combining resi-dual networks with LSTMs for lipreading[J]. arXiv:1703. 04105, 2017.
[96] STAFYLAKIS T, TZIMIROPOULOS G. Deep word embe-ddings for visual speech recognition[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscata-way: IEEE, 2018: 4974-4978.
[97] PETRIDIS S, STAFYLAKIS T, MA P, et al. End-to-end audio-visual speech recognition[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 6548-6552.
[98] PETRIDIS S, STAFYLAKIS T, MA P, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the 2018 IEEE Spoken Language Tech-nology Workshop, Athens, Dec 18-21, 2018. Piscataway: IEEE, 2018: 513-520.
[99] AFOURAS T, CHUNG J S, ZISSERMAN A. Deep lip rea-ding: a comparison of models and an online application[J]. arXiv:1806.06053, 2018.
[100] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018: 1-5.
[101] STERPU G, SAAM C, HARTE N, et al. Attention-based audio-visual fusion for robust automatic speech recognition[C]//Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, Oct 16-20, 2018. New York: ACM, 2018: 111-115.
[102] MARGAM D K, ARALIKATTI R, SHARMA T, et al. Lip reading with 3D-2D-CNN BLSTM-HMM and word-CTC models[J]. arXiv:1906.12170, 2019.
[103] ZHANG S, LEI M, MA B, et al. Robust audio-visual speech recognition using bimodal DFSMN with multi-condition training and dropout regularization[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscata-way: IEEE, 2019: 6570-6574.
[104] AFOURAS T, CHUNG J S, ZISSERMAN A. ASR is all you need: cross-modal distillation for lip reading[C]//Procee-dings of the 2020 IEEE International Conference on Acou-stics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 2143-2147.
[105] WANG C. Multi-grained spatio-temporal modeling for lip-reading[J]. arXiv:1908.11618, 2019.
[106] ZHANG X X, FENG C, WANG S L, et al. Spatio-temporal fusion based convolutional sequence learning for lip reading[C]//Proceedings of the 2019 IEEE/CVF International Con-ference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 713-722.
[107] ZHANG Y, YANG S, XIAO J, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[J]. arXiv:2003.03206, 2020.
[108] MARTíNEZ B, MA P C, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6319-6323.
[109] FENGHOUR S, CHEN D, GUO K, et al. Lip reading sen-tences using deep learning with only visual cues[J]. IEEE Access, 2020, 8: 215516-215530.
[110] STERPU G, SAAM C, HARTE N, et al. Should we hard-code the recurrence concept or learn it instead? Exploring the transformer architecture for audio-visual speech recog-nition[J]. arXiv:2005.09297, 2020.
[111] MA P C, MARTINEZ B, PETRIDIS S, et al. Towards prac-tical lip reading with distilled and efficient models[C]//Pro-ceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 7608-7612.
[112] MA P C, PETRIDIS S, PANTIC M, et al. End-to-end audio-visual speech recognition with conformers[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 7613-7617.
[113] LIU M, WANG L, LEE K A, et al. Exploring deep learning for joint audio-visual lip biometrics[J]. arXiv:2104.08510, 2021.
[114] NINOMIYA H, KITAOKA N, TAMURA S, et al. Integration of deep bottleneck features for audio-visual speech recog-nition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Sep 6-10, 2015: 563-567.