Transformer在语音识别任务中的研究现状与展望

doi:10.3778/j.issn.1673-9418.2103020

摘要/Abstract

摘要：

Transformer作为一种新的深度学习算法框架，得到了越来越多研究人员的关注，成为目前的研究热点。Transformer模型中的自注意力机制受人类只关注于重要事物的启发，只对输入序列中重要的信息进行学习。对于语音识别任务来说，重点是把输入语音序列的信息转录为对应的语言文本。过去的做法是将声学模型、发音词典和语言模型组成语音识别系统来实现语音识别任务，而Transformer可以将声学、发音和语言模型集成到单个神经网络中形成端到端语音识别系统，解决了传统语音识别系统的强制对齐和多模块训练等问题。因此，探讨Transformer在语音识别任务中存在的问题是非常有必要的。首先介绍Transformer的模型结构，并且从输入语音序列、深层模型结构和模型推理过程三方面对语音识别任务面临的问题进行分析；其次对现阶段解决语音识别中Transformer模型存在输入语音序列、深层模型结构和模型推理过程的问题进行方法总结和简要概述；最后对Transformer在语音识别任务中的应用方向进行总结和展望。

关键词: Transformer, 深度学习, 端到端, 语音识别

Abstract:

As a new deep learning algorithm framework, Transformer has attracted more and more researchers?? attention and has become a current research hotspot. Inspired by humans focusing on important things only, the self-attention mechanism in the Transformer model mainly learns important information in the input sequence. For speech recogni-tion tasks, the focus is to transcribe the information of the input speech sequence into the corresponding language text. The past practice was to combine acoustic models, pronunciation dictionaries, and language models into a speech recognition system to achieve speech recognition tasks, while Transformer can integrate them into a single neural network to form an end-to-end speech recognition system, which solves the issues such as forced alignment and multi-module training of the traditional speech recognition system. Therefore, it is very necessary to discuss the problems of Transformer in speech recognition tasks. In this paper, the structure of the Transformer model is first introduced. Besides, the problems confronted by speech recognition are analyzed with respect to input speech sequence, deep model architecture, and model inference. Then the methods to solve the obstacles within the three aspects afore mentioned are outlined and summarized. Finally, the future application and direction of Transformer in speech recognition are concluded and prospected.

Key words: Transformer, deep learning, end-to-end, speech recognition

张晓旭, 马志强, 刘志强, 朱方圆, 王春喻. Transformer在语音识别任务中的研究现状与展望[J]. 计算机科学与探索, 2021, 15(9): 1578-1594.

ZHANG Xiaoxu, MA Zhiqiang, LIU Zhiqiang, ZHU Fangyuan, WANG Chunyu. Research Status and Prospect of Transformer in Speech Recognition[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(9): 1578-1594.

参考文献

[1] BENGIO Y. Markovian models for sequential data[J]. Neural Computing Surveys, 1999, 2: 129-162.
[2] WANG D, WANG X, LV S. An overview of end-to-end auto-matic speech recognition[J]. Symmetry, 2019, 11(8): 1-27.
[3] YU D, DENG L. Deep learning and its applications to signal and information processing[J]. IEEE Signal Processing Maga-zine, 2011, 28(1): 145-154.
[4] WANG H K, PAN J, LIU C, et al. Research development and forecast of automatic speech recognition technologies[J]. Telecommunications Science, 2018, 34(2): 7-17.
王海坤, 潘嘉, 刘聪, 等. 语音识别技术的研究进展与展望[J]. 电信科学, 2018, 34(2): 7-17.
[5] HOU Y M, ZHOU H Q, WANG Z Y. Overview of speech recognition based on deep learning[J]. Application Research of Computers, 2017, 34(8): 2241-2246.
侯一民, 周慧琼, 王政一. 深度学习在语音识别中的研究进展综述[J]. 计算机应用研究, 2017, 34(8): 2241-2246.
[6] GRAVES A, FERNáNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Jun 25-29, 2006. New York: ACM, 2006: 369-376.
[7] GRAVES A. Sequence transduction with recurrent neural networks[J]. arXiv:1211.3711, 2012.
[8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
[9] DONG L, XU S, XU B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 5884-5888.
[10] BIE A, VENKITESH B, MONTEIRO J, et al. A simplified fully quantized transformer for end-to-end speech recognition[J]. arXiv:1911.03604, 2019.
[11] GRAVES A. Generating sequences with recurrent neural networks[J]. arXiv:1308.0850, 2013.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Piscataway: IEEE, 2016: 770-778.
[13] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[14] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learn-ing Representations, San Diego, May 7-9, 2015.
[15] SAINATH T N, MOHAMED A, KINGSBURY B, et al. Deep convolutional neural networks for LVCSR[C]//Procee-dings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, May 26-31, 2013: 8614-8618.
[16] TSOI A C. Recurrent neural network architectures: an overview[C]//LNCS 1387: Adaptive Processing of Sequences & Data Structures. Berlin, Heidelberg: Springer, 1998: 1-26.
[17] NGUYEN B, NGUYEN V B H, NGUYEN H, et al. Fast and accurate capitalization and punctuation for automatic speech recognition using transformer and chunk merging[C]//Proceedings of the 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, Philippines, Oct 25-27, 2019: 1-5.
[18] SHI Y, WANG Y, WU C, et al. Weak-attention suppression for transformer based speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 4996-5000.
[19] LI W, QIN J, CHIU C C, et al. Parallel rescoring with transformer for streaming on-device speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 2122-2126.
[20] MOHAMED A, OKHONKO D, ZETTLEMOYER L. Trans-formers with convolutional context for ASR[J]. arXiv:1904. 11660, 2019.
[21] LU L, LIU C, LI J, et al. Exploring transformers for large-scale speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Asso-ciation, Shanghai, Oct 25-29, 2020: 5041-5045.
[22] WENG S Y, CHEN B. Effective decoder masking for trans-former based end-to-end speech recognition[J]. arXiv:2010. 14764, 2020.
[23] CHEN X, ZHANG S, SONG D, et al. Transformer with bidirectional decoder for speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 1773-1777.
[24] ZHOU P, FAN R, CHEN W, et al. Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding[J]. arXiv:1911. 00203, 2019.
[25] POVEY D, HADIAN H, GHAHREMANI P, et al. A time-restricted self-attention layer for ASR[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, Apr 15-20, 2018. Piscata-way: IEEE, 2018: 5874-5878.
[26] SPERBER M, NIEHUES J, NEUBIG G, et al. Self-attentional acoustic models[C]//Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, Sep 2-6, 2018: 3723-3727.
[27] ZHAO Y Y, LI J, WANG X R, et al. The speech-transformer for large-scale mandarin Chinese speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 7095-7099.
[28] TSUNOO E, KASHIWAGI Y, KUMAKURA T, et al. Towards online end-to-end transformer automatic speech recognition[J]. arXiv:1910.11871, 2019.
[29] YEH C F, MAHADEOKAR J, KALGAONKAR K, et al. Transformer-transducer: end-to-end speech recognition with self-attention[J]. arXiv:1910.12977, 2019.
[30] MORITZ N, HORI T, LE J. Streaming automatic speech recognition with the transformer model[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6074-6078.
[31] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Sep 15-19, 2019: 66-70.
[32] TSUNOO E, KASHIWAGI Y, KUMAKURA T, et al. Trans-former ASR with contextual block processing[C]//Proceedings of the 2019 IEEE Automatic Speech Recognition and Under-standing Workshop, Singapore, Dec 14-18, 2019. Piscataway: IEEE, 2019: 427-433.
[33] SALAZAR J, KIRCHHOFF K, HUANG Z H. Self-attention networks for connectionist temporal classification in speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 7115-7119.
[34] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understand-ing[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186.
[35] ZHOU S Y, DONG L H, XU S, et al. Syllable-based sequence-to-sequence speech recognition with the transformer in Mandarin Chinese[C]//Proceedings of the 19th Annual Con-ference of the International Speech Communication Association, India, Sep 2-6, 2018: 791-795.
[36] XU M, LI S, ZHANG X L. Transformer-based end-to-end speech recognition with local dense synthesizer attention[J]. arXiv:2010.12155, 2020.
[37] WANG Y, MOHAMED A, LE D, et al. Transformer-based acoustic modeling for hybrid speech recognition[C]//Procee-dings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6874-6878.
[38] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 1994, 5(2): 157-166.
[39] TJANDRA A, LIU C, ZHANG F, et al. DEJA-VU: double feature presentation and iterated loss in deep transformer networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6899-6903.
[40] HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[J]. arXiv:1606.08415, 2016.
[41] WANG C, WU Y, DU Y, et al. Semantic mask for transformer based end-to-end speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Com-munication Association, Shanghai, Oct 25-29, 2020: 971-975.
[42] HRINCHUK O, POPOVA M, GINSBURG B. Correction of automatic speech recognition with transformer sequence-to-sequence model[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 7074-7078.
[43] XUE B, YU J, XU J, et al. Bayesian transformer language models for speech recognition[J]. arXiv:2102.04754, 2021.
[44] WINATA G I, CAHYAWIJAYA S, LIN Z, et al. Lightweight and efficient end-to-end speech recognition using low-rank transformer[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6144-6148.
[45] HUANG H, PENG F. An empirical study of efficient ASR rescoring with transformers[J]. arXiv:1910.11450, 2019.
[46] ZHANG S, LOWEIMI E, BELL P, et al. On the usefulness of self-attention for automatic speech recognition with trans-formers[J]. arXiv:2011.04906, 2020.
[47] WANG P, WANG D L. Efficient end-to-end speech recognition using performers in conformers[J]. arXiv:2011.04196, 2020.
[48] ZHANG S, LOWEIMI E, BELL P, et al. Stochastic attention head removal: a simple and effective method for improving automatic speech recognition with transformers[J]. arXiv:2011.04004, 2020.
[49] LUO H, ZHANG S, LEI M, et al. Simplified self-attention for transformer-based end-to-end speech recognition[J]. arXiv:2005.10463, 2020.
[50] WU C Y, WANG Y Q, SHI Y Y, et al. Streaming transformer-based acoustic models using self-attention with augmented memory[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 2132-2136.
[51] YEH C F, WANG Y, SHI Y, et al. Streaming attention-based models with augmented memory for end-to-end speech reco-gnition[C]//Proceedings of the 2021 IEEE Spoken Language Technology Workshop, Shenzhen, Jan 19-22, 2021. Piscataway: IEEE, 2021: 8-14.
[52] HIGUCHI Y, WATANABE S, CHEN N, et al. Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 3655-3659.
[53] MIAO H, CHENG G, GAO C, et al. Transformer-based online CTC/attention end-to-end speech recognition architec-ture[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6084-6088.
[54] TIAN Z, YI J, TAO J, et al. Spike-triggered non-autoregres-sive transformer for end-to-end speech recognition[C]//Proceedings of the 21st Annual Conference of the Interna-tional Speech Communication Association, Shanghai, Oct 25-29, 2020: 5026-5030.
[55] FAN R, CHU W, CHANG P, et al. CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition[J]. arXiv:2010.14725, 2020.
[56] INAGUMA H, HIGUCHI Y, DUH K, et al. Orthros: non-autoregressive end-to-end speech translation with dual-decoder[J]. arXiv:2010.13047, 2020.
[57] CHI E A, SALAZAR J, KIRCHHOFF K. Align-Refine: non-autoregressive speech recognition via iterative realignment[J]. arXiv:2010.14233, 2020.
[58] ZHANG S L, LEI M, YAN Z J. Investigation of transformer based spelling correction model for CTC-based end-to-end Mandarin speech recognition[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Sep 15-19, 2019: 2180-2184.
[59] HIGUCHI Y, INAGUMA H, WATANABE S, et al. Improved mask-CTC for non-autoregressive end-to-end ASR[J]. arXiv:2010.13270, 2020.
[60] FUJITA Y, WATANABE S, OMACHI M, et al. Insertion-based modeling for end-to-end automatic speech recognition[C]//Proceedings of the 21st Annual Conference of the Interna-tional Speech Communication Association, Shanghai, Oct 25-29, 2020: 3660-3664.
[61] REN Y, LIU J, TAN X, et al. A study of non-autoregressive model for sequence generation[J]. arXiv:2004.10454, 2020.
[62] BAI Y, YI J, TAO J, et al. Listen attentively, and spell once: whole sentence generation via a non-autoregressive archi-tecture for low-latency speech recognition[J]. arXiv:2005.04862, 2020.
[63] SONG X, WU Z, HUANG Y, et al. Non-autoregressive trans-former ASR with CTC-enhanced decoder input[J]. arXiv:2010.15025, 2020.
[64] CHEN X, WU Y, WANG Z, et al. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset[J]. arXiv:2010.11395, 2020.
[65] TRIPATHI A, KIM J, ZHANG Q, et al. Transformer transducer: one model unifying streaming and non-streaming speech recognition[J]. arXiv:2010.03192, 2020.
[66] ZHOU S, XU S, XU B. Multilingual end-to-end speech recognition with a single transformer on low-resource languages[J]. arXiv:1806.05059, 2018.
[67] LE H, PINO J, WANG C, et al. Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation[C]//Proceedings of the 28th International Con-ference on Computational Linguistics, Barcelona, Dec 8-13, 2020: 3520-3533.

编辑推荐 0

Metrics

阅读次数

全文

892

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	892

来源	本网站	其他网站

次数	664	228
比例	74%	26%

摘要

1369

最新录用	在线预览	正式出版

0	0	1369

	来源	本网站

	次数	1369
	比例	100%