Low-Resource Machine Translation Based on Training Strategy with Changing Gradient Weight

doi:10.3778/j.issn.1673-9418.2211078

Abstract

Abstract: In recent years, neural network models such as Transformer have achieved significant success in machine translation. However, training these models relies on rich labeled data, posing a challenge for low-resource machine translation due to the limited scale of parallel corpora. This limitation often leads to subpar performance and a susceptibility to overfitting on high-frequency vocabulary, thereby reducing the model’s generalization ability on the test set. To alleviate these issues, this paper proposes a strategy of gradient weight modification. Specifically, it suggests multiplying the gradients generated for each new batch by a coefficient on top of the Adam algorithm. This coefficient incrementally increases, aiming to weaken the model’s dependence on high-frequency features during early training while maintaining the rapid convergence advantage of the algorithm in the later stages. This paper also outlines the modified training process, including adjustments and decay of coefficients, to emphasize different aspects at different training stages. The goal of this strategy is to enhance attention to low-frequency vocabulary and prevent the model from overfitting to high-frequency terms. Experimental translation tasks are conducted on three low-resource bilingual datasets, and the proposed method demonstrates improvements of 0.72, 1.37, and 1.04 BLEU scores relative to the baseline model on the respective test set.

Key words: neural machine translation, overfitting, dynamic gradient weight

摘要： 近年来Transformer等神经网络模型在机器翻译上取得了显著的成功，但训练这些模型需要依靠丰富的有标签数据，而低资源机器翻译因受限于平行语料库的规模，导致训练得到的模型表现不佳，同时很容易针对高频词汇过度拟合，从而降低模型在测试集上的泛化能力。为了缓解这一现象，提出了一种梯度权重变化的策略，即在Adam算法基础上为每一个新批次所产生的梯度乘以一个系数。该系数递增变化，旨在在训练早期削弱对高频特征的依赖，而在训练后期保持算法的快速收敛优势。介绍了模型改进后的训练流程，其中包括系数的调整和衰减，以实现在不同训练阶段的不同侧重。这种策略的目标是增加对低频词汇的关注度，防止模型对高频词汇的过拟合。在三个低资源的双语数据集上进行了翻译任务实验，该方法在测试集上相对于基线模型分别提升了0.72、1.37和1.04个BLEU得分。

关键词: 神经机器翻译, 过拟合, 动态梯度权重

WANG Jiaqi, ZHU Junguo, YU Zhengtao. Low-Resource Machine Translation Based on Training Strategy with Changing Gradient Weight[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(3): 731-739.

王家琪, 朱俊国, 余正涛. 基于梯度权重变化训练策略的低资源机器翻译[J]. 计算机科学与探索, 2024, 18(3): 731-739.

References

[1] RICO S, BARRY H, ALEXANDRA B. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistic, Berlin, Aug 7-12, 2016. Stroudsburg: ACL, 2016: 86-96.
[2] MARZIEH F, ARIANNA B, CHRISTOF M. Data augmentation for low-resource neural machine translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Jul 30-Aug 4, 2017. Stroudsburg: ACL, 2017: 567-573.
[3] MENGZHOU X, XIANG K, ANTONIOS A, et al. Generalized data augmentation for low-resource translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 5786-5796.
[4] NITISH S, GEOFFREY H, ALEX K, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
[5] RAFAEL M, SIMON K, GEOFFREY H. When does label smoothing help?[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 4696-4705.
[6] YINGBO G, WANG W, CHRISTIAN H, et al. Towards a better understanding of label smoothing in neural machine translation[C]//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 212-223.
[7] ASHISH V, NOAM S, NIKI P, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[8] KENTON M, JEFFERY K, TOAN N, et al. Auto-sizing the transformer network: improving speed, efficiency, and performance for low-resource machine translation[C]//Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, China, Nov 4, 2019. Stroudsburg: ACL, 2019: 231-240.
[9] 冯洋, 邵晨泽. 神经机器翻译前沿综述[J]. 中文信息学报, 2018, 34(7): 1-18.
FENG Y, SHAO C Z. Frontiers in neural machine translation: a literature review[J]. Journal of Chinese Information Processing, 2018, 34(7): 1-18.
[10] DZMITRY B, KYUNGHYUN C, YOSHUA B. Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 6th International Conference for Learning Representations, San Diego, May 7-9, 2015.
[11] ILYA S, ORIOL V, QUOC L. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems 27, Montreal, Dec 8-13, 2014: 3104-3112.
[12] RICO S, BARRY H, ALEXANDRA B. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistic, Berlin, Aug 7-12, 2016. Stroudsburg: ACL, 2016: 1715-1725.
[13] JACOB D, CHANG M, KENTON L, et al. BERT: pre-training of deep bidirectional transformers for language understan-ding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186.
[14] GU S, ZHANG J, MENG F. Token-level adaptive training for neural machine translation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Nov 16-20, 2020. Stroudsburg: ACL, 2020:1035-1046.
[15] XU Y, LIU Y, MENG F. Bilingual mutual information based adaptive training for neural machine translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Bangkok, Aug 1-6, 2021. Stroudsburg: ACL, 2021: 511-516.
[16] CHRISTOS B, BARRY H, ALEXANDRA B. Language model prior for low-resource neural machine translation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Nov 16-20, 2020. Stroudsburg: ACL, 2020: 7622-7634.
[17] ZHU J, XIA Y, WU L. Incorporating BERT into neural machine translation[C]//Proceedings of the 8th International Conference for Learning Representations, Apr 26-May 1, 2020.
[18] ZHANG S, LIU Y, MENG F. Conditional bilingual mutual information based adaptive training for neural machine translation[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistic, Dublin, May 22-27, 2022. Stroudsburg: ACL, 2022: 2377-2389.
[19] LéON B. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of the 19th International Conference on Computational Statistics, Paris, Aug 22-27, 2010: 177-186.
[20] JOHN D, ELAD H, YORAM S. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12: 2121-2159.
[21] ADADELTA M Z. An adaptive learning rate method[J]. arXiv:1212.5701, 2012.
[22] DIEDERIK K, JIMMY B. Adam: a method for stochastic optimization[C]//Proceedings of the 3rd International Conference for Learning Representations, Banff, Apr 14-16, 2014: 404-413.
[23]?REDDI S, SATYEN K, SANJIV K. On the convergence speed of adam and beyond[C]//Proceedings of the 6th International Conference for Learning Representations, Vancouver, Apr 30-May 3, 2018.
[24] REBECCA K, SAMUEL L, DARLENE S, et al. NRC systems for low resource German-Upper Sorbian machine translation 2020: transfer learning with lexical modifications[C]//Proceedings of the 5th Conference on Machine Translation, Nov 16-20, 2020. Stroudsburg: ACL, 2020: 1112-1122.
[25] POST M， CALLISON-BURCH C， OSBORNE M. Constructing parallel Corpora for six Indian languages via crowdsourcing[C]//Proceedings of the 7th Workshop on Statistical Machine Translation, Montréal, Jun 7-8, 2012. Stroudsburg: ACL, 2012: 401-409.
[26] KISHORE P, SALIM R, TODD W, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Jul 6-12, 2002. Stroudsburg: ACL, 2002: 311-318.
[27] KYLE K, CROSSLEY S A, JARVIS S. Assessing the validity of lexical diversity indices using direct judgements[J]. Language Assessment Quarterly, 2021, 28(2): 154-170.