计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (3): 731-739.DOI: 10.3778/j.issn.1673-9418.2211078

• 人工智能·模式识别 • 上一篇    下一篇

基于梯度权重变化训练策略的低资源机器翻译

王家琪,朱俊国,余正涛   

  1. 1. 昆明理工大学 信息工程与自动化学院,昆明 650500
    2. 昆明理工大学 云南省人工智能重点实验室,昆明 650500
  • 出版日期:2024-03-01 发布日期:2024-03-01

Low-Resource Machine Translation Based on Training Strategy with Changing Gradient Weight

WANG Jiaqi, ZHU Junguo, YU Zhengtao   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
  • Online:2024-03-01 Published:2024-03-01

摘要: 近年来Transformer等神经网络模型在机器翻译上取得了显著的成功,但训练这些模型需要依靠丰富的有标签数据,而低资源机器翻译因受限于平行语料库的规模,导致训练得到的模型表现不佳,同时很容易针对高频词汇过度拟合,从而降低模型在测试集上的泛化能力。为了缓解这一现象,提出了一种梯度权重变化的策略,即在Adam算法基础上为每一个新批次所产生的梯度乘以一个系数。该系数递增变化,旨在在训练早期削弱对高频特征的依赖,而在训练后期保持算法的快速收敛优势。介绍了模型改进后的训练流程,其中包括系数的调整和衰减,以实现在不同训练阶段的不同侧重。这种策略的目标是增加对低频词汇的关注度,防止模型对高频词汇的过拟合。在三个低资源的双语数据集上进行了翻译任务实验,该方法在测试集上相对于基线模型分别提升了0.72、1.37和1.04个BLEU得分。

关键词: 神经机器翻译, 过拟合, 动态梯度权重

Abstract: In recent years, neural network models such as Transformer have achieved significant success in machine translation. However, training these models relies on rich labeled data, posing a challenge for low-resource machine translation due to the limited scale of parallel corpora. This limitation often leads to subpar performance and a susceptibility to overfitting on high-frequency vocabulary, thereby reducing the model’s generalization ability on the test set. To alleviate these issues, this paper proposes a strategy of gradient weight modification. Specifically, it suggests multiplying the gradients generated for each new batch by a coefficient on top of the Adam algorithm. This coefficient incrementally increases, aiming to weaken the model’s dependence on high-frequency features during early training while maintaining the rapid convergence advantage of the algorithm in the later stages. This paper also outlines the modified training process, including adjustments and decay of coefficients, to emphasize different aspects at different training stages. The goal of this strategy is to enhance attention to low-frequency vocabulary and prevent the model from overfitting to high-frequency terms. Experimental translation tasks are conducted on three low-resource bilingual datasets, and the proposed method demonstrates improvements of 0.72, 1.37, and 1.04 BLEU scores relative to the baseline model on the respective test set.

Key words: neural machine translation, overfitting, dynamic gradient weight