Journal of Frontiers of Computer Science and Technology ›› 2021, Vol. 15 ›› Issue (9): 1578-1594.DOI: 10.3778/j.issn.1673-9418.2103020

• Surveys and Frontiers • Previous Articles     Next Articles

Research Status and Prospect of Transformer in Speech Recognition

ZHANG Xiaoxu, MA Zhiqiang, LIU Zhiqiang, ZHU Fangyuan, WANG Chunyu   

  1. 1. College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010080, China
    2. Inner Mongolia Autonomous Region Engineering & Technology Research Centre of Big Data Based Software Service, Hohhot 010080, China
  • Online:2021-09-01 Published:2021-09-06

Transformer在语音识别任务中的研究现状与展望

张晓旭马志强刘志强朱方圆王春喻   

  1. 1. 内蒙古工业大学 数据科学与应用学院,呼和浩特 010080
    2. 内蒙古自治区基于大数据的软件服务工程技术研究中心,呼和浩特 010080

Abstract:

As a new deep learning algorithm framework, Transformer has attracted more and more researchers?? attention and has become a current research hotspot. Inspired by humans focusing on important things only, the self-attention mechanism in the Transformer model mainly learns important information in the input sequence. For speech recogni-tion tasks, the focus is to transcribe the information of the input speech sequence into the corresponding language text. The past practice was to combine acoustic models, pronunciation dictionaries, and language models into a speech recognition system to achieve speech recognition tasks, while Transformer can integrate them into a single neural network to form an end-to-end speech recognition system, which solves the issues such as forced alignment and multi-module training of the traditional speech recognition system. Therefore, it is very necessary to discuss the problems of Transformer in speech recognition tasks. In this paper, the structure of the Transformer model is first introduced. Besides, the problems confronted by speech recognition are analyzed with respect to input speech sequence, deep model architecture, and model inference. Then the methods to solve the obstacles within the three aspects afore mentioned are outlined and summarized. Finally, the future application and direction of Transformer in speech recognition are concluded and prospected.

Key words: Transformer, deep learning, end-to-end, speech recognition

摘要:

Transformer作为一种新的深度学习算法框架,得到了越来越多研究人员的关注,成为目前的研究热点。Transformer模型中的自注意力机制受人类只关注于重要事物的启发,只对输入序列中重要的信息进行学习。对于语音识别任务来说,重点是把输入语音序列的信息转录为对应的语言文本。过去的做法是将声学模型、发音词典和语言模型组成语音识别系统来实现语音识别任务,而Transformer可以将声学、发音和语言模型集成到单个神经网络中形成端到端语音识别系统,解决了传统语音识别系统的强制对齐和多模块训练等问题。因此,探讨Transformer在语音识别任务中存在的问题是非常有必要的。首先介绍Transformer的模型结构,并且从输入语音序列、深层模型结构和模型推理过程三方面对语音识别任务面临的问题进行分析;其次对现阶段解决语音识别中Transformer模型存在输入语音序列、深层模型结构和模型推理过程的问题进行方法总结和简要概述;最后对Transformer在语音识别任务中的应用方向进行总结和展望。

关键词: Transformer, 深度学习, 端到端, 语音识别