Journal of Frontiers of Computer Science and Technology ›› 2021, Vol. 15 ›› Issue (12): 2256-2275.DOI: 10.3778/j.issn.1673-9418.2106105

• Surveys and Frontiers • Previous Articles     Next Articles

Review of Extracting Methods for Lip Visual Features

MA Jinlin, GONG Yuanwen, MA Ziping, CHEN Deguang, ZHU Yanbin, LIU Yuhao   

  1. 1. School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China
    2. Key Laboratory for Intelligent Processing of Computer Images and Graphics of National Ethnic Affairs Commission of the PRC, Yinchuan 750021, China
    3. School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China
  • Online:2021-12-01 Published:2021-12-09

唇语识别的视觉特征提取方法综述

马金林,巩元文,马自萍,陈德光,朱艳彬,刘宇灏   

  1. 1. 北方民族大学 计算机科学与工程学院,银川 750021
    2. 图像图形智能处理国家民委重点实验室,银川 750021
    3. 北方民族大学 数学与信息科学学院,银川 750021

Abstract:

Current research on lip recognition focuses on improving recognition accuracy and studying features of multimodal inputs. However, little attention has been paid to improving the effectiveness of lip visual features. Lip visual information plays a key role in visual speech recognition and lip recognition. It is important when audio is destroyed or has no information. How to obtain accurate and effective lip visual features is one of the most difficult tasks in lip recognition. This paper reviews the latest research work on lip recognition in recent years from three aspects: lip dataset, traditional visual feature extraction methods, and in-depth learning methods for visual feature extraction. Firstly, this paper summarizes the dataset for lip recognition. The lip dataset is divided into two types: front view and multi-view. Further two types of datasets are summarized from their characteristics, limitations, and download addresses. Secondly, this paper introduces the traditional methods of lip visual feature extraction from the perspective of pixel point, shape and mixed features. The basic idea, network structure and features of each method are mainly introduced. In the deep learning method of lip visual feature extraction, the network structure, advantages and disadvantages of four deep learning methods are mainly introduced, such as 2D CNN, 3D CNN, 2D CNN combined with 3D CNN, and other neural networks. The performance of these methods on open datasets is compared. Finally, the challenges faced by lip visual feature extraction methods and future research trends are prospected.

Key words: lip recognition, visual feature, deep learning

摘要:

现有唇语识别研究多专注于提高识别精度、研究多模态输入特征等方面,对提高唇部视觉特征的有效性关注不多。而唇部的视觉信息在视觉语音识别和唇语识别中起着关键作用,尤其在音频被破坏或无音频信息时,唇部视觉信息尤为重要。如何获取准确有效的唇部视觉特征是当前唇语识别的难点工作之一。从唇语数据集、传统视觉特征提取方法、视觉特征提取的深度学习方法三方面综述了唇语识别方向近年来的最新  研究工作:首先,总结了唇语识别数据集,将唇语数据集分为正视图和多视图两种类型,并总结整理两类数据集的特点、局限性和下载地址;其次,从像素点、形状和混合特征的角度介绍了唇部视觉特征提取的传统方法,重点介绍各方法的基本思想、网络结构和特点;然后,介绍了唇部视觉特征提取的深度学习方法,重点介绍 2D CNN、3D CNN、2D CNN与3D CNN相结合、其他神经网络四种深度学习方法的网络结构和优缺点,并比较了这些方法在公开数据集上的性能表现;最后,对唇部视觉特征提取方法所面临的挑战和未来研究趋势进行了展望。

关键词: 唇语识别, 视觉特征, 深度学习