计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (7): 1207-1219.DOI: 10.3778/j.issn.1673-9418.2012062

• 综述·探索 • 上一篇    下一篇

序列数据的数据增强方法综述

葛轶洲,许翔,杨锁荣,周青,申富饶   

  1. 1. 通信信息控制和安全技术重点实验室,浙江 嘉兴 314033
    2. 中国电子科技集团公司 第三十六研究所,浙江 嘉兴 314033
    3. 计算机软件新技术国家重点实验室(南京大学),南京 210023
  • 出版日期:2021-07-01 发布日期:2021-07-09

Survey on Sequence Data Augmentation

GE Yizhou, XU Xiang, YANG Suorong, ZHOU Qing, SHEN Furao   

  1. 1. Science and Technology on Communication Information Security Control Laboratory, Jiaxing, Zhejiang 314033, China
    2. No.36 Research Institute, China Electronics Technology Group Corporation, Jiaxing, Zhejiang 314033, China
    3. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2021-07-01 Published:2021-07-09

摘要:

为了追求精度,深度学习模型框架的结构越来越复杂,网络越来越深。参数量的增加意味着训练模型需要更多的数据。然而人工标注数据的成本是高昂的,且受客观原因所限,实际应用时可能难以获得特定领域的数据,数据不足问题非常常见。数据增强通过人为地生成新的数据增加数据量来缓解这一问题。数据增强方法在计算机视觉领域大放异彩,让人们开始关注类似方法能否应用在序列数据上。除了翻转、裁剪等在时间域进行增强的方法外,也描述了在频率域实现数据增强的方法;除了人们基于经验或知识而设计的方法以外,对一系列基于GAN的通过机器学习模型自动生成数据的方法也进行了详细的论述。介绍了应用在自然语言文本、音频信号和时间序列等多种序列数据上的数据增强方法,亦有涉及它们在医疗诊断、情绪判断等问题上的表现。尽管数据类型不同,但总结了应用在这些类型上的数据增强方法背后的相似的设计思路。以这一思路为线索,梳理应用在各类序列数据类型上的多种数据增强方法,并进行了一定的讨论和展望。

关键词: 序列数据, 数据增强, 深度学习

Abstract:

To pursue higher accuracy, the structure of deep learning model is getting more and more complex, with deeper and deeper network. The increase in the number of parameters means that more data are needed to train the model. However, manually labeling data is costly, and it is not easy to collect data in some specific fields limited by objective reasons. As a result, data insufficiency is a very common problem. Data augmentation is here to alleviate the problem by artificially generating new data. The success of data augmentation in the field of computer vision leads people to consider using similar methods on sequence data. In this paper, not only the time-domain methods such as flipping and cropping but also some augmentation methods in frequency domain are described. In addition to experience-based or knowledge-based methods, detailed descriptions on machine learning models used for automatic data generation such as GAN are also included. Methods that have been widely applied to various sequence data such as text, audio and time series are mentioned with their satisfactory performance in issues like medical diagnosis and emotion classification. Despite the difference in data type, these methods are designed with similar ideas. Using these ideas as a clue, various data augmentation methods applied to different types of sequence data are introduced, and some discussions and prospects are made.

Key words: sequence data, data augmentation, deep learning