Method of Retrieval-Augmented Large Language Models with Stable Outputs for Private Question-Answering Systems

doi:10.3778/j.issn.1673-9418.2406060

Abstract

Abstract: The question-answering systems based on large language models (LLMs) are affected by the issues of semantic inconsistency from LLMs, resulting in an “unstable outputs” phenomenon, which impairs the safety, robustness and credibility of a question-answering system and severely degrades the user experience. To address the above issue, this paper proposes a method of retrieval-augmented LLMs’ stability output for private question-answering systems. This approach optimizes the use of prompt words, allowing the LLMs to first output num_k synonymous queries of the user’s query, and then output the final answer. This design aims to refer to the first generated num_k synonymous queries during the generation of the final answer, thereby making the output of the LLMs more stable. To tackle issues such as unstable generation of equivalent queries and uninterpretable output formats due to the weak instruction comprehension of open-source LLMs, this paper uses data distillation to automatically construct an open-domain retrieval-augmented instruction dataset through utilizing a closed-source LLM. Then an open-source LLM is fine-tuned on this instruction dataset. In addition, an evaluation dataset is developed under a private question-answering scenario to validate the effectiveness of the proposed method. Experimental results on the evaluation dataset demonstrate that the proposed method significantly surpasses the baseline method in terms of both consistency and performance metrics. Compared with the baseline method, the proposed method shows substantial improvement on the consistency metrics, with ROUGE-1, ROUGE-2, ROUGE-L, and BLEU scores improved by 18.9, 30.1, 24.5, and 30.6 percentage points respectively; as for performance metrics, the accuracy is increased by 17.4 percentage points.

Key words: large language models, retrieval-augmented generation, stability of large language models, question-answering systems

摘要： 基于大模型的问答系统受大模型语义不一致性问题的影响，会出现“输出结果不稳定”的现象，从而制约着问答系统的安全性、鲁棒性和可信度，严重影响了用户体验。针对上述问题，提出一种面向私有问答系统的检索增强式大模型稳定输出方法。该方法通过优化提示词，让大模型首先输出num_k个用户查询的同义查询，然后输出答案；目的是在大模型输出答案时，可以参考已经输出的num_k个同义查询，从而使大模型的输出结果更加稳定。针对开源大模型因指令理解能力弱而出现的“同义查询生成数目不稳定、输出格式无法解析”等问题，提出通过数据蒸馏的方式，利用闭源大模型自动构建了一个开放域上的检索增强式指令数据集，在该指令集上对开源大模型进行微调。构建了一个私有问答场景下的评估集以验证该方法的有效性。在上述评估集上的实验结果表明，该方法在一致性指标和效果指标上，均显著优于基线方法。与基线方法相比，该方法的一致性指标ROUGE-1、ROUGE-2、ROUGE-L和BLEU分别提升了18.9、30.1、24.5和30.6个百分点，效果指标正确率提升了17.4个百分点。

关键词: 大模型, 检索增强生成, 大模型稳定性, 问答系统

LI Boxin. Method of Retrieval-Augmented Large Language Models with Stable Outputs for Private Question-Answering Systems[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 132-140.

李铂鑫. 面向私有问答系统的检索增强式大模型稳定输出方法[J]. 计算机科学与探索, 2025, 19(1): 132-140.

References

[1] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 9459-9474.
[2] RAJ H, GUPTA V, ROSATI D, et al. Semantic consistency for assuring reliability of large language models[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2308.09138.
[3] ELAZAR Y, KASSNER N, RAVFOGEL S, et al. Measuring and improving consistency in pretrained language models[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 1012-1031.
[4] ZHAO Z, WALLACE E, FENG S, et al. Calibrate before use: improving few-shot performance of language models[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 12697-12706.
[5] PEREZ E, KIELA D, CHO K. True few-shot learning with language models[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 11054-11070.
[6] YOO K M, KIM J, KIM H J, et al. Ground-truth labels matter: a deeper look into input-label demonstrations[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.Stroudsburg: ACL, 2022: 2422-2437.
[7] CHEN Y, ZHAO C, YU Z, et al. On the relation between sensitivity and accuracy in in-context learning[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: ACL, 2023: 155-167.
[8] WANG X, WEI J, SCHUURMANS D, et al. Self-consis-tency improves chain of thought reasoning in language models[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2203.11171.
[9] CHEN X, AKSITOV R, ALON U, et al. Universal self-consistency for large language model generation[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2311.17311.
[10] WANG H, PRASAD A, STENGEL-ESKIN E, et al. Soft self-consistency improves language model agents[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2402.13212.
[11] JANG M, LUKASIEWICZ T. Consistency analysis of ChatGPT[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 15970-15985.
[12] WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 24824-24837.
[13] KENTON J D M W C, TOUTANOVA L K. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2019: 4171-4186.
[14] JANG M, KWON D S, LUKASIEWICZ T. BECEL: benchmark for consistency evaluation of language models[C]//Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Oct 12-17, 2022: 3680-3696.
[15] ZHOU C, HE J, MA X, et al. Prompt consistency for zero-shot task generalization[C]//Findings of the Association for Computational Linguistics: EMNLP 2022.Stroudsburg: ACL, 2022: 2613-2626.
[16] JANG M, KWON D S, LUKASIEWICZ T. Accurate, yet inconsistent? Consistency analysis on language understanding models[EB/OL]. [2024-04-03]. https://arxiv.org/abs/ 2108.06665.
[17] RABINOVICH E, ACKERMAN S, RAZ O, et al. Predicting question-answering performance of large language models through semantic consistency[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2311.01152.
[18] BACH S, SANH V, YONG Z X, et al. PromptSource: an integrated development environment and repository for natural language prompts[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: ACL, 2022: 93-104.
[19] MIN S, LEWIS M, ZETTLEMOYER L, et al. MetaICL: learning to learn in context[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2022: 2791-2809.
[20] CHEN Y, ZHONG R, ZHA S, et al. Meta-learning via language model in-context tuning[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 719-730.
[21] SRIKANTH N, CARPUAT M, RUDINGER R. How often are errors in natural language reasoning due to paraphrastic variability?[EB/OL]. [2024-05-24]. https://arxiv.org/abs/2404. 11717.
[22] CHEN A, PHANG J, PARRISH A, et al. Two failures of self-consistency in the multi-step reasoning of LLMs[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2305.14279.
[23] OHMER X, BRUNI E, HUPKES D. From form (s) to meaning: probing the semantic depths of language models using multisense consistency[EB/OL]. [2024-05-24]. https://arxiv.org/abs/ 2404.12145.
[24] YANG J, CHEN D, SUN Y, et al. Enhancing semantic consistency of large language models through model editing: an interpretability-oriented approach[C]//Findings of the Association for Computational Linguistics, ACL 2024. Stroudsburg: ACL, 2024: 3343-3353.
[25] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[26] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-04-03]. https://arxiv.org/abs/1907.11692.
[27] JANG M, LUKASIEWICZ T. Improving language models􀆳 meaning understanding and consistency by learning conceptual roles from dictionary[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 8496-8510.
[28] WANG X, LI Y, FENG S, et al. Integrate the essence and eliminate the dross: fine-grained self-consistency for free-form language generation[EB/OL]. [2024-08-10]. https://arxiv.org/abs/2407.02056.
[29] FAN A, LEWIS M, DAUPHIN Y. Hierarchical neural story generation[EB/OL]. [2024-04-03]. https://arxiv.org/abs/1805. 04833.
[30] HOLTZMAN A, BUYS J, DU L, et al. The curious case of neural text degeneration[EB/OL]. [2024-04-03]. https://arxiv.org/abs/1904.09751.
[31] RUDER S. An overview of multi-task learning in deep neural networks[EB/OL]. [2024-04-03]. https://arxiv.org/abs/1706. 05098.
[32] YANG C, WANG X, LU Y, et al. Large language models as optimizers[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2309. 03409.
[33] LI X L, LIANG P. Prefix-Tuning: optimizing continuous prompts for generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 4582-4597.
[34] LESTER B, AL-RFOU R, CONSTANT N. The power of scale for parameter-efficient prompt tuning[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 3045-3059.
[35] LIU X, ZHENG Y, DU Z, et al. GPT understands, too[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2103.10385.
[36] DING N, QIN Y, YANG G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5(3): 220-235.
[37] HU E J, SHEN Y, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2024-04-03]. https://arxiv.org/abs/2106.09685.
[38] DETTMERS T, PAGNONI A, HOLTZMAN A, et al. QLoRA: efficient finetuning of quantized LLMs[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023.
[39] LIU Y, LAPATA M. Text summarization with pretrained encoders[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 3730-3740.
[40] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.