计算机科学与探索

• 学术研究 •    下一篇

面向私有问答系统的检索增强式大模型稳定输出方法

李铂鑫   

  1. 1. 小米人工智能实验室, 北京 100085
    2. 中国科学院软件研究所 中文信息处理实验室, 北京 100190

A method of retrieval-augmented large language models with stable outputs for private question-answering systems

LI Boxin   

  1. 1. Xiaomi AI Lab, Beijing 100085, China
    2. Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences , Beijing 100190, China

摘要: 基于大模型的问答系统受大模型语义不一致性问题的影响,会出现“输出结果不稳定”的现象,从而制约着问答系统的安全性、鲁棒性和可信度,严重影响了用户体验。针对上述问题,提出一种面向私有问答系统的检索增强式大模型稳定输出方法。该方法通过优化提示词,让大模型首先输出num_k个用户查询的同义查询,然后再输出答案;目的是在大模型输出答案时,可以参考已经输出的num_k个同义查询,从而使大模型的输出结果更加稳定。针对开源大模型因指令理解能力弱而出现的“同义查询生成数目不稳定、输出格式无法解析”等问题,提出通过数据蒸馏的方式,利用闭源大模型自动构建了一个开放域上的检索增强式指令数据集,然后在该指令集上对开源大模型进行微调。此外,构建了一个私有问答场景下的评估集以验证该方法的有效性。在上述评估集上的实验结果表明,该方法在一致性指标和效果指标上,均显著优于基线方法。特别地,与基线方法相比,该方法的一致性指标ROUGE-1、ROUGE-2、ROUGE-L和BLEU分别提升了18.9、30.1、24.5和30.6,效果指标正确率提升了17.4%。

关键词: 大模型, 检索增强生成, 大模型稳定性, 问答系统

Abstract: The question-answering systems based on large language models (LLMs) was affected by the issues of semantic inconsistency from LLMs, resulting in an "unstable outputs" phenomenon, which impaired the safety, robustness and credibility of a question-answering system and severely degraded the user experience. To address the above issue, this paper proposed a method of retrieval-augmented LLMs’ stability output for private question-answering systems. This approach optimized the use of prompt words, allowing the LLMs to first output num_k synonymous queries of the user’s query, and then output the final answer. This design aimed to refer to the first generated num_k synonymous queries during the generation of the final answer, thereby making the output of the LLMs more stable. To tackle issues such as unstable generation of equivalent queries and uninterpretable output formats due to the weak instruction comprehension of open-source LLMs, this paper proposed using data distillation to automatically construct an open-domain retrieval-augmented instruction dataset through utilizing a closed-source LLM. Then an open-source LLM was fine-tuned on this instruction dataset. In addition, an evaluation dataset was developed under a private question-answering scenario to validate the effectiveness of the proposed method. Experimental results on the evaluation dataset demonstrated that the proposed method significantly surpassed the baseline method in terms of both consistency and performance metrics. Notably, compared to the baseline method, the proposed method showed substantial improvement on the consistency metrics, with ROUGE-1, ROUGE-2, ROUGE-L, and BLEU scores improved by 18.9, 30.1, 24.5, and 30.6 respectively; as for performance metrics, the accuracy rate increased by 17.4%.

Key words: large language models, retrieval-augmented generation, stability of large language models, question- answering systems