[1] ZHANG X B, GONG H G, YANG F, et al. Chinese sentence-level lip reading based on end-to-end model[J]. Journal of Software, 2020, 31(6): 1747-1760.
张晓冰, 龚海刚, 杨帆, 等. 基于端到端句子级别的中文唇语识别研究[J]. 软件学报, 2020, 31(6): 1747-1760.
[2] ZHOU Z H, ZHAO G Y, HONG X P, et al. A review of re-cent advances in visual speech decoding[J]. Image and Vision Computing, 2014, 32(9): 590-605.
[3] MATHULAPRANGSAN S, WANG C Y, USUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the 2015 International Conference on Orange Technologies, Hong Kong, China, Dec 19-22, 2015. Piscataway: IEEE, 2015: 22-25.
[4] WANG M. Lip feature selection based on BPSO and SVM[C]//Proceedings of the 2011 IEEE 10th International Con-ference on Electronic Measurement & Instruments, Chengdu, Aug 16-19, 2011. Piscataway: IEEE, 2011: 56-59.
[5] LIU H, FAN T, WU P P. Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction[C]//Proceedings of the 2014 IEEE International Conference on Robotics and Automation, Hong Kong, China, May 31-Jun 7, 2014. Piscataway: IEEE, 2014: 6644-6651.
[6] MA J L, CHEN D G, GUO B B, et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13.
马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13.
[7] MATTHEWS I, COOTES T F, BANGHAM J A, et al. Ext-raction of visual features for lipreading[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 198-213.
[8] COOKE M, BARKER J, CUNNINGHAM S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5): 2421-2424.
[9] ZHAO G Y, BARNARD M, PIETIKAINEN M, et al. Lip-reading with local spatiotemporal descriptors[J]. IEEE Trans-actions on Multimedia, 2009, 11(7): 1254-1265.
[10] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//LNCS 10112: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 87-103.
[11] JING H, POTAMIANOS G, CONNELL J, et al. Audio-visual speech recognition using an infrared headset[J]. Speech Communication, 2004, 44: 83-96.
[12] MCCOOL C, LEVY C, MATROUF D, et al. Bi-modal per-son recognition on a mobile phone: using mobile phone data[C]//Proceedings of the 2012 IEEE International Con-ference on Multimedia and Expo Workshops, Melbourne, Jul 9-13, 2012. Washington: IEEE Computer Society, 2012: 635-640.
[13] LAN Y X, THEOBALD B J, HARVEY R, et al. View inde-pendent computer lip-reading[C]//Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, Melbourne, Jul 9-13, 2012. Washington: IEEE Computer Society, 2012: 432-437.
[14] KUMAR K, CHEN T, STERN R M, et al. Profile view lip reading[C]//Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, Apr 15-20, 2007. Piscataway: IEEE, 2007: 429-432.
[15] PATTERSON E K, GURBUZ S, TUFEKCI Z, et al. CUAVE: a new audio-visual database for multimodal human-computer interface research[C]//Proceedings of the 2002 IEEE Inter-national Conference on Acoustics, Speech, and Signal Pro-cessing, Orlando, May 13-17, 2002. Piscataway: IEEE, 2002: 2017-2020.
[16] LAN Y X, THEOBALD B J, HARVEY R W, et al. Impro-ving visual features for lip-reading[C]//Proceedings of the Auditory-Visual Speech Processing, Hakone, Sep 30-Oct 3, 2010.
[17] ESTELLERS V, THIRAN J P. Multi-pose audio-visual speech recognition[C]//Proceedings of the 19th European Signal Processing Conference, Barcelona, Aug 29-Sep 2, 2011. Piscataway: IEEE, 2011: 1065-1069.
[18] ANINA I, ZHOU Z H, ZHAO G Y, et al. OuluVS2: a multi-view audiovisual database for non-rigid mouth motion ana-lysis[C]//Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, Ljubljana, May 4-8, 2015. Washington: IEEE Computer Society, 2015: 1-5.
[19] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3444-3453.
[20] AFOURAS T, CHUNG J S, ZISSERMAN A, et al. LRS3-TED: a large-scale dataset for visual speech recognition[J]. arXiv:1809.00496, 2018.
[21] YANG S, ZHANG Y H, FENG D L, et al. LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild[C]//Proceedings of the 2019 14th IEEE Interna-tional Conference on Automatic Face & Gesture Recogni-tion, Lille, May 14-18, 2019. Piscataway: IEEE, 2019: 1-8.
[22] RONG C Z, YUE Z J, JIA Y X, et al. Research advances in key technology of lip-reading[J]. Journal of Data Acqui-sition and Processing, 2012(S2): 277-283.
荣传振, 岳振军, 贾永兴, 等. 唇语识别关键技术研究进展[J]. 数据采集与处理, 2012(S2): 277-283.
[23] DUPONT S, LUETTIN J. Audio-visual speech modeling for continuous speech recognition[J]. IEEE Transactions on Multimedia, 2000, 2(3): 141-151.
[24] POTAMIANOS G, LUETTIN J, NETI C, et al. Hierarchical discriminant features for audio-visual LVCSR[C]//Procee-dings of the 2001 IEEE International Conference on Acou-stics, Speech, and Signal Processing, Salt Lake City, May 7-11, 2001. Piscataway: IEEE, 2001: 165-168.
[25] MARCHERET E, LIBAL V, POTAMIANOS G, et al. Dyna-mic stream weight modeling for audio-visual speech recog-nition[C]//Proceedings of the 2007 IEEE International Con-ference on Acoustics, Speech, and Signal Processing, Hono-lulu, Apr 15-20, 2007. Piscataway: IEEE, 2007: 945-948.
[26] ALMAJAI I, COX S J, HARVEY R W, et al. Improved speaker independent lip reading using speaker adaptive training and deep neural networks[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, Mar 20-25, 2016. Piscata-way: IEEE, 2016: 2722-2726.
[27] SHAIKH A A, KUMAR D K, YAU W C, et al. Lip reading using optical flow and support vector machines[C]//Pro-ceedings of the 3rd International Congress on Image and Signal Processing, Yantai, Oct 16-18, 2010. Piscataway: IEEE, 2010: 327-330.
[28] OJALA T, PIETIKAINEN M, MAENPAA T, et al. Multi-resolution gray-scale and rotation invariant texture classifica-tion with local binary patterns[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7): 971-987.
[29] ZHAO G, PIETIKAINEN M. Dynamic texture recognition using local binary patterns with an application to facial expressions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 915-928.
[30] ZHOU Z H, ZHAO G Y, PIETIK?INEN M, et al. Towards a practical lipreading system[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recogni-tion, Colorado Springs, Jun 20-25, 2011. Washington: IEEE Computer Society, 2011: 137-144.
[31] REKIK A, BEN-HAMADOU A, MAHDI W, et al. An adap-tive approach for lip-reading using image and depth data[J]. Multimedia Tools and Applications, 2016, 75(14): 8609-8636.
[32] LI M, CHEUNG Y M. A novel motion based lip feature extraction for lip-reading[C]//Proceedings of the 2008 Inter-national Conference on Computational Intelligence and Security, Suzhou, Dec 13-17, 2008. Washington: IEEE Com-puter Society, 2008: 361-365.
[33] ALIZADEH S, BOOSTANI R, ASADPOUR V, et al. Lip feature extraction and reduction for HMM-based visual speech recognition systems[C]//Proceedings of the 9th International Conference on Signal Processing, Beijing, Oct 26-29, 2008. Piscataway: IEEE, 2008: 561-564.
[34] MA X J, YAN L, ZHONG Q Y, et al. Lip feature extraction based on improved jumping-snake model[C]//Proceedings of the 2016 35th Chinese Control Conference, Chengdu, Jul 27-29, 2016. Piscataway: IEEE, 2016: 6928-6933.
[35] KASS M, WITKIN A, TERAOPOULOS D, et al. Snakes: active contour models[J]. International Journal of Computer Vision, 1988, 1(4): 321-331.
[36] COOTES T F, TAYLOR C J, COOPER D H, et al. Active shape models-their training and application[J]. Computer Vision & Image Understanding, 1995, 61(1): 38-59.
[37] CHEN J Y, TIDDEMAN B, ZHAO G, et al. Real-time lip contour extraction and tracking using an improved active contour model[C]//LNCS 5359: Proceedings of the 4th International Symposium on Visual Computing, Las Vegas, Dec 1-3, 2008. Berlin, Heidelberg: Springer, 2008: 236-245.
[38] COOTES T F, EDWARDS G J, TAYLOR C J, et al. Active appearance models[J]. IEEE Transactions on Pattern Ana-lysis and Machine Intelligence, 2001, 23(6): 681-685.
[39] LAN Y X, HARVEY R W, THEOBALD B J, et al. Insights into machine lip reading[C]//Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Mar 25-30, 2012. Piscataway: IEEE, 2012: 4825-4828.
[40] WATANABE T, KATSURADA K, KANAZAWA Y, et al. Lip reading from multi view facial images using 3D-AAM[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 303-316.
[41] ALEKSIC P S, KATSAGGELOS A K. Audio-visual biome-trics[J]. Proceedings of the IEEE, 2006, 94(11): 2025-2044.
[42] STILLITTANO S, GIRONDEL V, CAPLIER A, et al. Lip contour segmentation and tracking compliant with lip-reading application constraints[J]. Machine Vision and Applications, 2013, 24(1): 1-18.
[43] LUCEY P, POTAMIANOS G, SRIDHARAN S, et al. A uni-fied approach to multi-pose audio-visual ASR[C]//Procee-dings of the 8th Annual Conference of the International Speech Communication Association, Antwerp, Aug 27-31, 2007: 650-653.
[44] GURBAN M, THIRAN J P. Information theoretic feature extraction for audio-visual speech recognition[J]. IEEE Trans-actions on Signal Processing, 2009, 57(12): 4765-4776.
[45] NAVARATHNA R, KLEINSCHMIDT T, DEAN D, et al. Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment?[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Aug 27-31, 2011: 2241-2244.
[46] ESTELLERS V, THIRAN J P. Multi-pose lip reading and audio-visual speech recognition[J]. EURASIP Journal on Advances in Signal Processing, 2012(1): 1-23.
[47] LEE D, LEE J, KIM K E, et al. Multi-view automatic lip-reading using neural network[C]//LNCS 10117: Proceedings of the 13th Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 290-302.
[48] FERNANDEZ-LOPEZ A, MARTINEZ O, SUKNO F M, et al. Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database[C]//Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition, Washington, May 30-Jun 3, 2017. Washington: IEEE Computer Society, 2017: 208-215.
[49] GOLDSCHEN A J, GARCIA O N, PETAJAN E, et al. Con-tinuous optical automatic speech recognition by lipreading[C]//Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, Oct 31-Nov 2, 1994. Washington: IEEE Computer Society, 1994: 572-577.
[50] CAPPELLETTA L, HARTE N. Viseme definitions compa-rison for visual-only speech recognition[C]//Proceedings of the 19th European Signal Processing Conference, Barcelona, Aug 29-Sep 2, 2011. Piscataway: IEEE, 2011: 2109-2113.
[51] WAND M, KOUTNIK J, SCHMIDHUBER J, et al. Lip rea-ding with long short-term memory[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, Mar 20-25, 2016. Piscata-way: IEEE, 2016: 6115-6119.
[52] RAHMANI M H, ALMASGANJ F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features[C]//Proceedings of the 2017 3rd International Conference on Pattern Recognition and Image Analysis, Shahrekord, Apr 19-20, 2017. Piscataway: IEEE, 2017: 195-199.
[53] WANG S L, LIEW A W C, LAU W H, et al. An automatic lip reading system for spoken digits with limited training data[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(12): 1760-1765.
[54] NODA K, YAMAGUCHI Y, NAKADAI K, et al. Lipreading using convolutional neural network[C]//Proceedings of the 15th Annual Conference of the International Speech Com-munication Association, Singapore, Sep 14-18, 2014: 1149-1153.
[55] GARG A, NOYOLA J, BAGADIA S. Lip reading using CNN and LSTM[R]. Stanford University, 2016.
[56] NODA K, YAMAGUCHI Y, NAKADAI K, et al. Audio-visual speech recognition using deep learning[J]. Applied Intelligence, 2015, 42(4): 722-737.
[57] ZHOU P, YANG W, CHEN W, et al. Modality attention for end-to-end audio-visual speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 6565-6569.
[58] SAITOH T, ZHOU Z H, ZHAO G Y, et al. Concatenated frame image based CNN for visual speech recognition[C]//LNCS 10117: Proceedings of the Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 277-289.
[59] LIN M, CHEN Q, YAN S. Network in network[J]. arXiv:1312.4400, 2013.
[60] MESBAH A, BERRAHOU A, HAMMOUCHI H, et al. Lip reading with Hahn convolutional neural networks[J]. Image and Vision Computing, 2019, 88: 76-83.
[61] ASSAEL Y M, SHILLINGFORD B, WHITESON S, et al. LipNet: sentence-level lipreading[J]. arXiv:1611.01599, 2016.
[62] FUNG I, MAK B. End-to-end low-resource lip-reading with maxout CNN and LSTM[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 2511-2515.
[63] XU K, LI D, CASSIMATIS N, et al. LCANet: end-to-end lip reading with cascaded attention-CTC[C]//Proceedings of the 2018 13th IEEE International Conference on Auto-matic Face & Gesture Recognition, Xi??an, May 15-19, 2018. Piscataway: IEEE, 2018: 548-555.
[64] WENG X, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lip reading[J]. arXiv:1905.02540, 2019.
[65] WIRIYATHAMMABHUM P. SpotFast networks with me-mory augmented lateral transformers for lipreading[C]//Pro-ceedings of the 27th International Conference on Neural Information Processing, Bangkok, Nov 18-22, 2020. Cham: Springer, 2020: 554-561.
[66] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G, et al. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176: 22-32.
[67] FENG D, YANG S, SHAN S, et al. Learn an effective lip reading model without pains[J]. arXiv:2011.07557, 2020.
[68] AFOURAS T, CHUNG J S, ZISSERMAN A, et al. My lips are concealed: audio-visual speech enhancement through obstructions[J]. arXiv:1907.04975, 2019.
[69] XU B, LU C, GUO Y D, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 14433-14442.
[70] LUO M, YANG S, SHAN S, et al. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading[J]. arXiv:2003.03983, 2020.
[71] XIAO J, YANG S, ZHANG Y, et al. Deformation flow based two-stream network for lip reading[J]. arXiv:2003.05709, 2020.
[72] ZHAO X, YANG S, SHAN S, et al. Mutual information ma-ximization for effective lip reading[J]. arXiv:2003.06439, 2020.
[73] PETRIDIS S, WANG Y, LI Z, et al. End-to-end audiovisual fusion with LSTMs[J]. arXiv:1709.04343, 2017.
[74] PETRIDIS S, LI Z, PANTIC M. End-to-end visual speech recognition with LSTMs[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, Mar 5-9, 2017. Piscataway: IEEE, 2017: 2592-2596.
[75] PETRIDIS S, WANG Y, LI Z, et al. End-to-end multi-view lip reading[J]. arXiv:1709.00443, 2017.
[76] PETRIDIS S, SHEN J, CETIN D, et al. Visual-only reco-gnition of normal, whispered and silent speech[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 6219-6223.
[77] WAND M, SCHMIDHUBER J. Improving speaker-indepen-dent lipreading with domain-adversarial training[J]. arXiv:1708.01565, 2017.
[78] WAND M, SCHMIDHUBER J, VU N T, et al. Investigations on end-to-end audiovisual fusion[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscata-way: IEEE, 2018: 3041-3045.
[79] MOON S, KIM S, WANG H, et al. Multimodal transfer deep learning with applications in audio-visual recognition[J]. arXiv:1412.3121, 2014.
[80] LI Y, TAKASHIMA Y, TAKIGUCHI T, et al. Lip reading using a dynamic feature of lip images and convolutional neural networks[C]//Proceedings of the 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, Okayama, Jun 26-29, 2016. Piscataway: IEEE, 2016: 1-6.
[81] CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//LNCS 10117: Proceedings of the 2016 Asian Conference on Computer Vision, Taipei, China, Nov 20-24, 2016. Cham: Springer, 2016: 251-263.
[82] GUTIERREZ A, ROBERT Z. Lip reading word classifica-tion[R]. Stanford University, 2017.
[83] CHUNG J S, ZISSERMAN A. Learning to lip read words by watching videos[J]. Computer Vision and Image Under-standing, 2018, 173: 76-85.
[84] OLIVEIRA D A B, MATTOS A B, MORAIS E, et al. Im-proving viseme recognition using GAN-based frontal view mapping[C]//Proceedings of the 2018 IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition Work-shops, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 2148-2155.
[85] NADEEMHASHMI S, GUPTA H, MITTAL D, et al. A lip reading model using CNN with batch normalization[C]//Proceedings of the 2018 11th International Conference on Contemporary Computing , Noida, Aug 2-4, 2018. Piscata-way: IEEE, 2018: 1-6.
[86] JHA A, NAMBOODIRI V P, JAWAHAR C V, et al. Word spotting in silent lip videos[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Piscataway: IEEE, 2018: 150-159.
[87] MATTOS A B, OLIVEIRA D A B, MORAIS E D S, et al. Improving CNN-based viseme recognition using synthetic data[C]//Proceedings of the 2018 IEEE International Con-ference on Multimedia and Expo, San Diego, Jul 23-27, 2018. Piscataway: IEEE, 2018: 1-6.
[88] ZHANG X B, GONG H G, DAI X L, et al. Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 9211-9218.
[89] ZHAO Y, XU R, WANG X, et al. Hearing lips: improving lip reading by distilling speech recognizers[C]//Proceedings of the 2020 AAAI Conference on Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 6917-6924.
[90] ZHAO Y, XU R, SONG M L. A cascade sequence-to-sequ-ence model for Chinese mandarin lip reading[C]//Procee-dings of the ACM Multimedia Asia, Beijing, Dec 16-18, 2019. New York: ACM, 2019: 1-6.
[91] TORFI A, IRANMANESH S M, NASRABADI N, et al. 3D convolutional neural networks for cross audio-visual mat-ching recognition[J]. IEEE Access, 2017, 5: 22081-22091.
[92] SHILLINGFORD B, ASSAEL Y, HOFFMAN M W, et al. Large-scale visual speech recognition[J]. arXiv:1807.05162, 2018.
[93] KUMAR Y, JAIN R, SALIK K M, et al. Lipper: synthesizing thy speech using multi-view lipreading[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Ho-nolulu, Jan 27-Feb 1, 2019. New York: ACM, 2019: 2588-2595.
[94] LIU J L, REN Y, ZHAO Z, et al. FastLR: non-autoregressive lipreading model with integrate-and-fire[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 4328-4336.
[95] STAFYLAKIS T, TZIMIROPOULOS G. Combining resi-dual networks with LSTMs for lipreading[J]. arXiv:1703. 04105, 2017.
[96] STAFYLAKIS T, TZIMIROPOULOS G. Deep word embe-ddings for visual speech recognition[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscata-way: IEEE, 2018: 4974-4978.
[97] PETRIDIS S, STAFYLAKIS T, MA P, et al. End-to-end audio-visual speech recognition[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 6548-6552.
[98] PETRIDIS S, STAFYLAKIS T, MA P, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the 2018 IEEE Spoken Language Tech-nology Workshop, Athens, Dec 18-21, 2018. Piscataway: IEEE, 2018: 513-520.
[99] AFOURAS T, CHUNG J S, ZISSERMAN A. Deep lip rea-ding: a comparison of models and an online application[J]. arXiv:1806.06053, 2018.
[100] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018: 1-5.
[101] STERPU G, SAAM C, HARTE N, et al. Attention-based audio-visual fusion for robust automatic speech recognition[C]//Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, Oct 16-20, 2018. New York: ACM, 2018: 111-115.
[102] MARGAM D K, ARALIKATTI R, SHARMA T, et al. Lip reading with 3D-2D-CNN BLSTM-HMM and word-CTC models[J]. arXiv:1906.12170, 2019.
[103] ZHANG S, LEI M, MA B, et al. Robust audio-visual speech recognition using bimodal DFSMN with multi-condition training and dropout regularization[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscata-way: IEEE, 2019: 6570-6574.
[104] AFOURAS T, CHUNG J S, ZISSERMAN A. ASR is all you need: cross-modal distillation for lip reading[C]//Procee-dings of the 2020 IEEE International Conference on Acou-stics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 2143-2147.
[105] WANG C. Multi-grained spatio-temporal modeling for lip-reading[J]. arXiv:1908.11618, 2019.
[106] ZHANG X X, FENG C, WANG S L, et al. Spatio-temporal fusion based convolutional sequence learning for lip reading[C]//Proceedings of the 2019 IEEE/CVF International Con-ference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 713-722.
[107] ZHANG Y, YANG S, XIAO J, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[J]. arXiv:2003.03206, 2020.
[108] MARTíNEZ B, MA P C, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6319-6323.
[109] FENGHOUR S, CHEN D, GUO K, et al. Lip reading sen-tences using deep learning with only visual cues[J]. IEEE Access, 2020, 8: 215516-215530.
[110] STERPU G, SAAM C, HARTE N, et al. Should we hard-code the recurrence concept or learn it instead? Exploring the transformer architecture for audio-visual speech recog-nition[J]. arXiv:2005.09297, 2020.
[111] MA P C, MARTINEZ B, PETRIDIS S, et al. Towards prac-tical lip reading with distilled and efficient models[C]//Pro-ceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 7608-7612.
[112] MA P C, PETRIDIS S, PANTIC M, et al. End-to-end audio-visual speech recognition with conformers[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 7613-7617.
[113] LIU M, WANG L, LEE K A, et al. Exploring deep learning for joint audio-visual lip biometrics[J]. arXiv:2104.08510, 2021.
[114] NINOMIYA H, KITAOKA N, TAMURA S, et al. Integration of deep bottleneck features for audio-visual speech recog-nition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Sep 6-10, 2015: 563-567. |