Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

CFB:Financial Large Models Evaluation Methods

LI Yi,  LI Hao,  XU Xiaozhe,  YANG Yifan   

  1. 1.School of Information, Shanxi University of Finance and Economics, Taiyuan  030000, China
    2.School of Statistics, Shanxi University of Finance and Economics, Taiyuan  030000, China
    3.Mechanical and Vehicular Engineering, Taiyuan University of Technology, Taiyuan  030000, China
    4. School of Information Engineering, Zhongnan University of Economics and Law, Wuhan  430073, China

CFB:金融领域大模型评估方法

李毅, 李浩, 许骁哲, 杨一凡   

  1. 1. 山西财经大学 信息学院, 太原 030000
    2. 山西财经大学 统计学院, 太原 030000
    3. 太原理工大学 机械与运载工程学院, 太原 030000
    4. 中南财经政法大学 信息工程学院, 武汉 430073

Abstract: As the potential applications of large language models (LLMs) in the financial sector continue to emerge, evaluating the performance of financial LLMs becomes increasingly important. However, current financial evaluation methods face limitations such as singular evaluation tasks, insufficient coverage of evaluation datasets, and contamination of benchmark data. Consequently, the potential of LLMs in the financial domain has not been fully explored. To address these issues, this paper proposes the Chinese Financial Benchmark (CFB) for evaluating financial LLMs. The CFB encompasses 36 datasets, covers 24 financial tasks, and involves seven evaluation tasks: information extraction, text analysis, question answering, text generation, risk management, prediction, and decision-making. It also establishes corresponding benchmarks.The new approach of the CFB includes a broader range of tasks and data, the introduction of a benchmark decontamination method based on LLMs, and three evaluation methods: instruction fine-tuning, knowledge retrieval enhancement, and prompt engineering. The evaluation of 12 LLMs, including GPT-4o, ChatGPT, and Gemini, reveals that while LLMs excel in information extraction and text analysis, they struggle with advanced reasoning and complex tasks. GPT-4o performs exceptionally in information extraction and stock trading, whereas Gemini excels in text generation and prediction. Instruction fine-tuning improves LLMs' performance in text analysis but offers limited benefits for complex tasks.

Key words: Financial large models, evaluation benchmark, prompt engineering, retrieval-augmented generation, instruction fine-tuning

摘要: 随着大语言模型(LLMs)在金融领域的应用潜力不断显现,评估金融大模型的性能变得尤为重要。然而,由于当下的金融评估方法评估任务单一、评测数据集覆盖面不足以及测评基准数据污染等方面的局限,大模型在金融领域的潜力尚未得到充分探索。基于此,本文提出了中文金融大模型评估方法CFB,通过构建36个数据集,涵盖24个金融任务,涉及7个金融大模型测评任务:多项问答、术语解释、文本生成、文本翻译、分类任务、语步识别、预测决策,并构建了相应的测评基准。CFB提出的新思路包括:更广泛的任务和数据范围,引入了基于LLM的基准去污方法以及基于指令微调、知识检索增强和提示词工程三种方法的评估。并对包括GPT-4o、ChatGPT和Gemini在内的12个LLMs进行了评估,实验结果显示:虽然LLMs在信息提取和文本分析方面表现出色,但在高级推理和复杂任务方面存在困难。GPT-4o在信息提取和股票交易方面表现突出,而Gemini在文本生成和预测方面更胜一筹。经过指令微调的LLMs在文本分析上有所提升,但对复杂任务提供的益处有限。

关键词: 金融大模型, 评估基准, 提示词, 知识检索增强, 指令微调