基于离散扩散模型的序列到序列文本生成方法

doi:10.3778/j.issn.1673-9418.2405063

摘要/Abstract

摘要： 扩散语言模型是目前非自回归模型中最具潜力的语言模型，有望取代饱受推理速度缓慢问题拖累的自回归语言模型，实现高效且无损于质量的文本生成。文本摘要生成、机器翻译、对话生成等序列到序列的生成任务是扩散语言模型经常遇到的实际应用场景，如何更好更快地实现序列到序列的文本生成一直是自然语言处理领域的研究重点。为了实现上述目标，通过推导离散扩散模型训练目标的上界简化了扩散模型的训练过程，随后引入并改造了条件掩码语言模型的遮掩-预测解码策略作为扩散模型的推理算法，提升了模型的生成质量。为了进一步提升离散扩散模型在推理的前几轮生成文本的质量，还提出了正弦噪音调度，相比于原来的线性噪音调度，时间步中高噪音区间变得更大了，模型将更专注于学习如何从高噪音数据中恢复数据，从而提升在推理的前几轮中生成文本的质量。受到课程学习策略的启发，设计了新的时间步采样分布，通过操纵时间步的采样实现由易到难的学习。在公开数据集上的实验表明，提出的方法能有效提升模型的性能，在WMT16 EN-RO数据集上，扩散模型仅用自回归基线一半的推理时间，就能推理出相同生成质量的文本。

关键词: 扩散模型, 语言模型, 文本生成, 序列到序列, 非自回归模型

Abstract: Diffusion language models are currently the most promising language models among non-autoregressive models, and are expected to replace autoregressive language models, which suffer from slow inference speed, to achieve efficient and quality-preserving text generation. Sequence-to-sequence （Seq2Seq） text generation is a common scenario encountered in practical applications of diffusion language models, including text summarization generation, machine translation, dialogue generation, etc. Achieving high-quality Seq2Seq text generation with low latency remains a persistent challenge in the field of natural language processing. To this end, this paper simplifies the training process of the discrete diffusion model by deriving an upper bound on the training objective, and subsequently introduces and modifies the mask-and-predict decoding strategy from the conditional mask language model as the inference algorithm of the diffusion model. In order to further improve the quality of generated text in the first few rounds of inference for discrete diffusion models, this paper also proposes a sinusoidal noise schedule. Compared with the original linear noise schedule, the high noise interval in time steps becomes larger, and the model will focus more on learning how to recover data from high noise data that are common encountered by the model in the first few rounds of inference. Inspired by curriculum learning strategies, this paper also designs a new sampling distribution for time steps to achieve an easy-to-hard learning strategy. Experiments on public datasets show that the proposed method can effectively improve model performance. On the WMT16 EN-RO dataset, the diffusion model achieves comparable generation quality to the autoregressive baseline in only half the inference time.

Key words: diffusion model, language model, text generation, sequence to sequence, non-autoregressive model

蒋航, 蔡国永, 李思慧. 基于离散扩散模型的序列到序列文本生成方法[J]. 计算机科学与探索, 2025, 19(3): 764-773.

JIANG Hang, CAI Guoyong, LI Sihui. Sequence-to-Sequence Text Generation with Discrete Diffusion Models[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(3): 764-773.

参考文献

[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[2] OUYANG L, WU J, XU J, et al. Training language models to follow instructions with human feedback[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2024: 27730-27744.
[3] GU J, BRADBURY J, XIONG C, et al. Non-autoregressive neural machine translation[C]//Proceedings of the 6th International Conference on Learning Representations, 2018.
[4] SOHL-DICKSTEIN J, WEISS E A, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of the 32nd International Conference on Machine Learning- Volume 37, 2015: 2256-2265.
[5] LI X L, THICKSTUN J, GULRAJANI I, et al. Diffusion-LM improves controllable text generation[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2024: 4328-4343.
[6] GONG S, LI M, FENG J, et al. DiffuSeq: sequence to sequence text generation with diffusion models[C]//Proceedings of the 11th International Conference on Learning Representations, 2023.
[7] HAN X C, KUMAR S, TSVETKOV Y. SSD-LM: semi-autoregressive simplex-based diffusion language model for text generation and modular control[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 11575-11596.
[8] HOOGEBOOM E, NIELSEN D, JAINI P, et al. Argmax flows and multinomial diffusion[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems, 2024: 12454-12465.
[9] HE Z F, SUN T X, TANG Q, et al. DiffusionBERT: improving generative masked language models with diffusion models[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 4521- 4534.
[10] ZHENG L, YUAN J B, YU L, et al. A reparameterized discrete diffusion model for text generation[EB/OL]. [2024-02-12]. https://arxiv.org/abs/2302.05737.
[11] LIN Z, GONG Y, SHEN Y, et al. Text generation with diffusion language models: a pre-training approach with continuous paragraph denoise[C]//Proceedings of the 11th International Conference on Learning Representations, 2023: 21051-21064.
[12] YE J S, ZHENG Z X, BAO Y, et al. DINOISER: diffused conditional sequence learning by manipulating noises[EB/OL]. [2024-02-12]. https://arxiv.org/abs/2302.10025.
[13] BENGIO Y, LOURADOUR J, COLLOBERT R, et al. Curriculum learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning, 2009: 41-48.
[14] AUSTIN J, JOHNSON D D, HO J, et al. Structured denoising diffusion models in discrete state-spaces[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems, 2024: 17981-17993.
[15] KENTON J D M W C, TOUTANOVA L K. BERT: pre-training of deep bi-directional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Stroudsburg: ACL, 2019: 4171-4186.
[16] CHEN J A, ZHANG A, LI M, et al. A cheaper and better diffusion language model with soft-masked noise[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 4765- 4775.
[17] STRUDEL R, TALLEC C, ALTCHÉ F, et al. Self-conditioned embedding diffusion for text generation[C]//Proceedings of the 11th International Conference on Learning Representations,2023.
[18] QIAN L H, ZHOU H, BAO Y, et al. Glancing transformer for non-autoregressive neural machine translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 1993-2003.
[19] GHAZVININEJAD M, LEVY O, LIU Y H, et al. Mask-predict: parallel decoding of conditional masked language models[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 6112-6121.
[20] OTT M, EDUNOV S, BAEVSKI A, et al. Fairseq: a fast, extensible toolkit for sequence modeling[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2019: 48-53.
[21] CETTOLO M, NIEHUES J, STÜKER S, et al. Report on the 11th IWSLT evaluation campaign[C]//Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, 2014: 2-17.
[22] BOJAR O, BUCK C, FEDERMANN C, et al. Findings of the 2014 workshop on statistical machine translation[C]//Proceedings of the 9th Workshop on Statistical Machine Translation. Stroudsburg: ACL, 2014: 12-58.
[23] BOJAR O, CHATTERJEE R, FEDERMANN C, et al. Findings of the 2016 conference on machine translation[C]//Proceedings of the 1st Conference on Machine Translation. Stroudsburg: ACL, 2016: 131-198.
[24] LEE J, MANSIMOV E, CHO K. Deterministic non-autoregressive neural sequence modeling by iterative refinement[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2018: 1173-1182.
[25] DHINGRA B, MAZAITIS K, COHEN W W. Quasar: datasets for question answering by search and reading[EB/OL]. [2024-02-12]. https://arxiv.org/abs/1707.03904.
[26] SHARMA L, GRAESSER L, NANGIA N, et al. Natural language understanding with the quora question pairs dataset[EB/OL]. [2024-02-12]. https://arxiv.org/abs/1907.01041.
[27] SAVINOV N, CHUNG J, BINKOWSKI M, et al. Step-unrolled denoising autoencoders for text generation[C]//Proceedings of the 10th International Conference on Learning Representations, 2022.
[28] GHAZVININEJAD M, LEVY O, ZETTLEMOYER L. Semi- autoregressive training improves mask-predict decoding[EB/OL]. [2024-02-12]. https://arxiv.org/abs/2001.08785.
[29] HUANG X, PÉREZ F, VOLKOVS M. Improving non-autoregressive translation models without distillation[C]//Proceedings of the 10th International Conference on Learning Representations, 2022.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	12	0	18

	来源	本网站

	次数	30
	比例	100%

摘要

最新录用	在线预览	正式出版

26	0	16

	来源	本网站

	次数	42
	比例	100%