计算机科学与探索

• 学术研究 •    下一篇

基于大语言模型的NLP数据增强方法综述

许德龙, 林民, 王玉荣, 张树钧   

  1. 1.内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
    2.内蒙古师范大学 数学科学学院,呼和浩特 010022
    3.内蒙古师范大学 文学院,呼和浩特 010022

A Survey of NLP Data Augmentation Methods Based on Large Language Models

XU Delong, LIN Min, WANG Yurong, ZHANG Shujun   

  1. 1.College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
    2.College of Mathematics Sciences, Inner Mongolia Normal University, Hohhot 010022, China
    3.College of Literature, Inner Mongolia Normal University, Hohhot 010022, China

摘要: 当前,大语言模型在自然语言处理(NLP)领域展现出巨大的潜力,但其训练过程依赖于大量高质量样本。在低资源场景下,随着模型规模不断扩大,现有数据样本数量难以支撑模型训练收敛,这一问题激发了相关领域科研工作者对于数据增强方法的研究。然而,传统数据增强方法在NLP领域大模型背景下存在应用范围有限和数据失真的问题。相比之下,基于大语言模型的数据增强方法能够更有效地应对这一挑战。本文全面探讨了现阶段NLP领域大语言模型数据增强方法,采用了综合性的视角研究NLP领域数据增强。首先,对NLP领域传统数据增强方法进行分析与总结。接着,将现阶段NLP领域多种大语言模型数据增强方法归纳总结,并深入探讨了每一种方法的适用范围、优点以及局限性。随后,介绍了NLP领域数据增强评估方法。最后,通过对当前方法的对比实验和结果分析讨论了NLP领域大语言模型数据增强方法的未来研究方向,并提出了前瞻性建议。本文的目标是为NLP领域大语言模型工作者提供关键见解,最终促进该领域大语言模型的发展。

关键词: 数据增强方法, 大语言模型, 自然语言处理, 深度学习, 人工智能

Abstract: Currently, large language models show great potential in the field of natural language processing (NLP), but their training process relies on a large number of high-quality samples. In low-resource scenarios, the number of existing data samples can hardly support the convergence of model training as the model size keeps increasing, and this problem has inspired researchers in related fields to investigate data augmentation methods. However, traditional data enhancement methods have limited application scope and data distortion problems in the context of large models in NLP. In contrast, data enhancement methods based on large language models can address this challenge more effectively. This paper comprehensively explores the current stage of large language model data enhancement methods in the NLP domain and adopts a comprehensive perspective to study data enhancement in the NLP domain. Firstly, the development history of traditional data enhancement methods and big language models in the NLP domain is reviewed. Then, a variety of large language model data enhancement methods in the NLP domain at this stage are summarised, and the scope of application, advantages and limitations of each method are discussed in depth. Subsequently, data enhancement evaluation methods in the field of NLP are introduced. Finally, future research directions of data enhancement methods for large language models in the NLP domain are discussed through comparative experiments and result analyses of current methods, and prospective suggestions are made. The goal of this paper is to provide key insights to those working on big language modelling in the NLP domain, and ultimately to promote the development of big language modelling in the field.

Key words: data augmentation, large language models, natural language processing, deep learning, artificial intelligence