[1] BENGIO Y. Markovian models for sequential data[J]. Neural Computing Surveys, 1999, 2: 129-162.
[2] WANG D, WANG X, LV S. An overview of end-to-end auto-matic speech recognition[J]. Symmetry, 2019, 11(8): 1-27.
[3] YU D, DENG L. Deep learning and its applications to signal and information processing[J]. IEEE Signal Processing Maga-zine, 2011, 28(1): 145-154.
[4] WANG H K, PAN J, LIU C, et al. Research development and forecast of automatic speech recognition technologies[J]. Telecommunications Science, 2018, 34(2): 7-17.
王海坤, 潘嘉, 刘聪, 等. 语音识别技术的研究进展与展望[J]. 电信科学, 2018, 34(2): 7-17.
[5] HOU Y M, ZHOU H Q, WANG Z Y. Overview of speech recognition based on deep learning[J]. Application Research of Computers, 2017, 34(8): 2241-2246.
侯一民, 周慧琼, 王政一. 深度学习在语音识别中的研究进展综述[J]. 计算机应用研究, 2017, 34(8): 2241-2246.
[6] GRAVES A, FERNáNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Jun 25-29, 2006. New York: ACM, 2006: 369-376.
[7] GRAVES A. Sequence transduction with recurrent neural networks[J]. arXiv:1211.3711, 2012.
[8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
[9] DONG L, XU S, XU B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 5884-5888.
[10] BIE A, VENKITESH B, MONTEIRO J, et al. A simplified fully quantized transformer for end-to-end speech recognition[J]. arXiv:1911.03604, 2019.
[11] GRAVES A. Generating sequences with recurrent neural networks[J]. arXiv:1308.0850, 2013.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Piscataway: IEEE, 2016: 770-778.
[13] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[14] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learn-ing Representations, San Diego, May 7-9, 2015.
[15] SAINATH T N, MOHAMED A, KINGSBURY B, et al. Deep convolutional neural networks for LVCSR[C]//Procee-dings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, May 26-31, 2013: 8614-8618.
[16] TSOI A C. Recurrent neural network architectures: an overview[C]//LNCS 1387: Adaptive Processing of Sequences & Data Structures. Berlin, Heidelberg: Springer, 1998: 1-26.
[17] NGUYEN B, NGUYEN V B H, NGUYEN H, et al. Fast and accurate capitalization and punctuation for automatic speech recognition using transformer and chunk merging[C]//Proceedings of the 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, Philippines, Oct 25-27, 2019: 1-5.
[18] SHI Y, WANG Y, WU C, et al. Weak-attention suppression for transformer based speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 4996-5000.
[19] LI W, QIN J, CHIU C C, et al. Parallel rescoring with transformer for streaming on-device speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 2122-2126.
[20] MOHAMED A, OKHONKO D, ZETTLEMOYER L. Trans-formers with convolutional context for ASR[J]. arXiv:1904. 11660, 2019.
[21] LU L, LIU C, LI J, et al. Exploring transformers for large-scale speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Asso-ciation, Shanghai, Oct 25-29, 2020: 5041-5045.
[22] WENG S Y, CHEN B. Effective decoder masking for trans-former based end-to-end speech recognition[J]. arXiv:2010. 14764, 2020.
[23] CHEN X, ZHANG S, SONG D, et al. Transformer with bidirectional decoder for speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 1773-1777.
[24] ZHOU P, FAN R, CHEN W, et al. Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding[J]. arXiv:1911. 00203, 2019.
[25] POVEY D, HADIAN H, GHAHREMANI P, et al. A time-restricted self-attention layer for ASR[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, Apr 15-20, 2018. Piscata-way: IEEE, 2018: 5874-5878.
[26] SPERBER M, NIEHUES J, NEUBIG G, et al. Self-attentional acoustic models[C]//Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, Sep 2-6, 2018: 3723-3727.
[27] ZHAO Y Y, LI J, WANG X R, et al. The speech-transformer for large-scale mandarin Chinese speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 7095-7099.
[28] TSUNOO E, KASHIWAGI Y, KUMAKURA T, et al. Towards online end-to-end transformer automatic speech recognition[J]. arXiv:1910.11871, 2019.
[29] YEH C F, MAHADEOKAR J, KALGAONKAR K, et al. Transformer-transducer: end-to-end speech recognition with self-attention[J]. arXiv:1910.12977, 2019.
[30] MORITZ N, HORI T, LE J. Streaming automatic speech recognition with the transformer model[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6074-6078.
[31] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Sep 15-19, 2019: 66-70.
[32] TSUNOO E, KASHIWAGI Y, KUMAKURA T, et al. Trans-former ASR with contextual block processing[C]//Proceedings of the 2019 IEEE Automatic Speech Recognition and Under-standing Workshop, Singapore, Dec 14-18, 2019. Piscataway: IEEE, 2019: 427-433.
[33] SALAZAR J, KIRCHHOFF K, HUANG Z H. Self-attention networks for connectionist temporal classification in speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 7115-7119.
[34] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understand-ing[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186.
[35] ZHOU S Y, DONG L H, XU S, et al. Syllable-based sequence-to-sequence speech recognition with the transformer in Mandarin Chinese[C]//Proceedings of the 19th Annual Con-ference of the International Speech Communication Association, India, Sep 2-6, 2018: 791-795.
[36] XU M, LI S, ZHANG X L. Transformer-based end-to-end speech recognition with local dense synthesizer attention[J]. arXiv:2010.12155, 2020.
[37] WANG Y, MOHAMED A, LE D, et al. Transformer-based acoustic modeling for hybrid speech recognition[C]//Procee-dings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6874-6878.
[38] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 1994, 5(2): 157-166.
[39] TJANDRA A, LIU C, ZHANG F, et al. DEJA-VU: double feature presentation and iterated loss in deep transformer networks[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6899-6903.
[40] HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[J]. arXiv:1606.08415, 2016.
[41] WANG C, WU Y, DU Y, et al. Semantic mask for transformer based end-to-end speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Com-munication Association, Shanghai, Oct 25-29, 2020: 971-975.
[42] HRINCHUK O, POPOVA M, GINSBURG B. Correction of automatic speech recognition with transformer sequence-to-sequence model[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 7074-7078.
[43] XUE B, YU J, XU J, et al. Bayesian transformer language models for speech recognition[J]. arXiv:2102.04754, 2021.
[44] WINATA G I, CAHYAWIJAYA S, LIN Z, et al. Lightweight and efficient end-to-end speech recognition using low-rank transformer[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6144-6148.
[45] HUANG H, PENG F. An empirical study of efficient ASR rescoring with transformers[J]. arXiv:1910.11450, 2019.
[46] ZHANG S, LOWEIMI E, BELL P, et al. On the usefulness of self-attention for automatic speech recognition with trans-formers[J]. arXiv:2011.04906, 2020.
[47] WANG P, WANG D L. Efficient end-to-end speech recognition using performers in conformers[J]. arXiv:2011.04196, 2020.
[48] ZHANG S, LOWEIMI E, BELL P, et al. Stochastic attention head removal: a simple and effective method for improving automatic speech recognition with transformers[J]. arXiv:2011.04004, 2020.
[49] LUO H, ZHANG S, LEI M, et al. Simplified self-attention for transformer-based end-to-end speech recognition[J]. arXiv:2005.10463, 2020.
[50] WU C Y, WANG Y Q, SHI Y Y, et al. Streaming transformer-based acoustic models using self-attention with augmented memory[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 2132-2136.
[51] YEH C F, WANG Y, SHI Y, et al. Streaming attention-based models with augmented memory for end-to-end speech reco-gnition[C]//Proceedings of the 2021 IEEE Spoken Language Technology Workshop, Shenzhen, Jan 19-22, 2021. Piscataway: IEEE, 2021: 8-14.
[52] HIGUCHI Y, WATANABE S, CHEN N, et al. Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 3655-3659.
[53] MIAO H, CHENG G, GAO C, et al. Transformer-based online CTC/attention end-to-end speech recognition architec-ture[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 6084-6088.
[54] TIAN Z, YI J, TAO J, et al. Spike-triggered non-autoregres-sive transformer for end-to-end speech recognition[C]//Proceedings of the 21st Annual Conference of the Interna-tional Speech Communication Association, Shanghai, Oct 25-29, 2020: 5026-5030.
[55] FAN R, CHU W, CHANG P, et al. CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition[J]. arXiv:2010.14725, 2020.
[56] INAGUMA H, HIGUCHI Y, DUH K, et al. Orthros: non-autoregressive end-to-end speech translation with dual-decoder[J]. arXiv:2010.13047, 2020.
[57] CHI E A, SALAZAR J, KIRCHHOFF K. Align-Refine: non-autoregressive speech recognition via iterative realignment[J]. arXiv:2010.14233, 2020.
[58] ZHANG S L, LEI M, YAN Z J. Investigation of transformer based spelling correction model for CTC-based end-to-end Mandarin speech recognition[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Sep 15-19, 2019: 2180-2184.
[59] HIGUCHI Y, INAGUMA H, WATANABE S, et al. Improved mask-CTC for non-autoregressive end-to-end ASR[J]. arXiv:2010.13270, 2020.
[60] FUJITA Y, WATANABE S, OMACHI M, et al. Insertion-based modeling for end-to-end automatic speech recognition[C]//Proceedings of the 21st Annual Conference of the Interna-tional Speech Communication Association, Shanghai, Oct 25-29, 2020: 3660-3664.
[61] REN Y, LIU J, TAN X, et al. A study of non-autoregressive model for sequence generation[J]. arXiv:2004.10454, 2020.
[62] BAI Y, YI J, TAO J, et al. Listen attentively, and spell once: whole sentence generation via a non-autoregressive archi-tecture for low-latency speech recognition[J]. arXiv:2005.04862, 2020.
[63] SONG X, WU Z, HUANG Y, et al. Non-autoregressive trans-former ASR with CTC-enhanced decoder input[J]. arXiv:2010.15025, 2020.
[64] CHEN X, WU Y, WANG Z, et al. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset[J]. arXiv:2010.11395, 2020.
[65] TRIPATHI A, KIM J, ZHANG Q, et al. Transformer transducer: one model unifying streaming and non-streaming speech recognition[J]. arXiv:2010.03192, 2020.
[66] ZHOU S, XU S, XU B. Multilingual end-to-end speech recognition with a single transformer on low-resource languages[J]. arXiv:1806.05059, 2018.
[67] LE H, PINO J, WANG C, et al. Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation[C]//Proceedings of the 28th International Con-ference on Computational Linguistics, Barcelona, Dec 8-13, 2020: 3520-3533. |