CFB：Financial Large Models Evaluation Methods

doi:10.3778/j.issn.1673-9418.2406055

Abstract

Abstract: As the potential applications of large language models (LLMs) in the financial sector continue to emerge, evaluating the performance of financial LLMs becomes increasingly important. However, current financial evaluation methods face limitations such as singular evaluation tasks, insufficient coverage of evaluation datasets, and contamination of benchmark data. Consequently, the potential of LLMs in the financial domain has not been fully explored. To address these issues, this paper proposes the Chinese financial benchmark (CFB) for evaluating financial LLMs. The CFB encompasses 36 datasets, covers 24 financial tasks, and involves 7 evaluation tasks: question answering, terminology explanation, text generation, text translation, classification task, voice recognition, and predictive decision. It also establishes corresponding benchmarks. The new approach of the CFB includes a broader range of tasks and data, the introduction of a benchmark decontamination method based on LLMs, and three evaluation methods: instruction fine-tuning, knowledge retrieval enhancement, and prompt engineering. The evaluation of 12 LLMs, including GPT-4o, ChatGPT, and Gemini, reveals that though LLMs excel in information extraction and text analysis, they struggle with advanced reasoning and complex tasks. GPT-4o performs exceptionally in information extraction and stock trading, whereas Gemini excels in text generation and prediction. Instruction fine-tuning improves LLMs’ performance in text analysis but offers limited benefits for complex tasks.

Key words: financial large models, evaluation benchmark, prompt engineering, knowledge retrieval enhancement, instruction fine-tuning

摘要： 随着大语言模型（LLM）在金融领域的应用潜力不断显现，评估金融大模型的性能变得尤为重要。然而，由于当下的金融评估方法评估任务单一、评测数据集覆盖面不足以及测评基准数据污染等方面的局限，大模型在金融领域的潜力尚未得到充分探索。基于此，提出了中文金融大模型评估方法CFB，构建36个数据集，涵盖24个金融任务，涉及多项问答、术语解释、文本生成、文本翻译、分类任务、语步识别、预测决策7个金融大模型测评任务，并构建了相应的测评基准。CFB提出的新思路包括：更广泛的任务和数据范围，引入了基于LLM的基准去污方法以及基于指令微调、知识检索增强和提示词工程3种方法的评估。并对包括GPT-4o、ChatGPT和Gemini在内的12个LLM进行了评估，实验结果显示，虽然LLM在信息提取和文本分析方面表现出色，但在高级推理和复杂任务方面存在困难。GPT-4o在信息提取和股票交易方面表现突出，而Gemini在文本生成和预测方面更胜一筹。经过指令微调的LLM在文本分析上有所提升，但对复杂任务提供的益处有限。

关键词: 金融大模型, 评估基准, 提示词工程, 知识检索增强, 指令微调

LI Yi, LI Hao, XU Xiaozhe, YANG Yifan. CFB：Financial Large Models Evaluation Methods[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3272-3287.

李毅, 李浩, 许骁哲, 杨一凡. CFB：金融领域大模型评估方法[J]. 计算机科学与探索, 2024, 18(12): 3272-3287.

References

[1] OpenAI. GPT-4 technical report[EB/OL]. [2024-04-16]. https:// arxiv.org/abs/2303.08774.
[2] BUBECK S, CHAN V, ELDAN R. Sparks of artificial general intelligence: early experiments with GPT-4[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2303.12712.
[3] DAI Y, FENG D, HUANG J, et al. LAiW: a Chinese legal large language models benchmark[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2310.05620.
[4] LEI Y, LI J, CHENG D. CFBenchmark: Chinese financial assistant benchmark for large language model[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2311.05812.
[5] FENG D, DAI Y, HUANG J. Empowering many, biasing a few: generalist credit scoring through large language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2310.00566.
[6] XIE Q, HAN W, ZHANG X, et al. PIXIU: a large language model, instruction data and evaluation benchmark for finance [EB/OL]. [2024-04-16]. https://arxiv.org/abs/2211.00083.
[7] LU D, WU H, LIANG J, et al. BBT-Fin: comprehensive construction of Chinese financial domain pre-trained language model, corpus and benchmark[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2302.09432.
[8] ZHANG L, CAI W, LIU Z, et al. FinEval: a Chinese financial domain knowledge evaluation benchmark for large language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2308.09975.
[9] YANG S, CHANG W L, ZHENG L, et al. Rethinking benchmark and contamination for language models with rephrased samples[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2311.04850.
[10] ZHANG H, DA J, LEE D, et al. A careful examination of large language model performance on grade school arithmetic[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2405.00332.
[11] YANG H, LIU X Y, WANG C D. FinGPT: open-source financial large language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2306.06031.
[12] HUHO T, THIBAUT L, GAUTIER I, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2302.13971.
[13] YU Y M. Cornucopia-LLaMA-Fin-Chinese[EB/OL]. (2023-05-10)[2024-04-16]. https://github.com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese.
[14] CHEN W, WANG Q, LONG Z, et al. DISC-FinLLM: a Chinese financial large language model based on multiple experts fine-tuning[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2310.15205.
[15] WU S J, OZAN I, STEVEN L, et al. BloombergGPT: a large language model for finance[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2303.17564.
[16] ZHANG X, YANG Q, XU D. XuanYuan 2.0: a large Chinese financial chat model with hundreds of billions parameters[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2305.12002.
[17] 申丽萍, 何朝帆, 曹东旭, 等. 大语言模型在中学历史学科中的应用测评分析[J]. 现代教育技术, 2024, 34(2): 62-71.
SHEN L P, HE C F, CAO D X, et al. Evaluation and analysis of the application of large language models in secondary school history[J]. Modern Educational Technology, 2024, 34(2): 62-71.
[18] PANG C X, CAO Y X, YANG C H, et al. Uncovering limitations of large language models in information seeking from tables[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2406.04113.
[19] CHEN M, JERRY T, JUN H, et al. Evaluating large language models trained on code[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2107.03374.
[20] 罗世杰. 金融大模型: 应用、风险与制度应对[J]. 金融发展研究, 2024, 9(6): 70-78.
LUO S J. Financial large models: applications, risks, and institutional responses[J]. Financial Development Research, 2024, 9(6): 70-78.
[21] 许志伟, 李海龙, 李博, 等. AIGC大模型测评综述：使能技术、安全隐患和应对[J]. 计算机科学与探索, 2024, 18(9): 2293-2325.
XU Z W, LI H L, LI B, et al. Survey of AIGC large model evaluation: enabling technologies, vulnerabilities and mitigation[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2293-2325.
[22] ZHAO H Q, LIU Z L, WU Z, et al. Revolutionizing finance with LLMs: an overview of applications and insights[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2401.11641.
[23] ADLAKHA V, PARISHAD B G, LU X H. Evaluating correctness and faithfulness of instruction-following models for question answering[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2307.16877.
[24] 邱冬阳, 蓝宇. ChatGPT给金融行业带来的机遇、挑战及问题[J]. 西南金融, 2023(6): 18-29.
QIU D Y, LAN Y. Opportunities, challenges, and issues brought to the financial industry by ChatGPT[J]. Southwest Finance, 2023(6): 18-29.
[25] MA X T, MOHAMMAD J D, WANG C H, et al. SimulEval: an evaluation toolkit for simultaneous translation[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2007.16193.
[26] LUANA B, ALDO G, MISAEL M. Do language models understand morality? Towards a robust detection of moral content[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2406.04143.
[27] RAJ S S, KUNAL C, DHEERAJ E, et al. When FLUE meets FLANG: benchmarks and large pre-trained language model for financial domain[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2211.00083.
[28] AGAM S, SUVAN P, SUDHEER C. Trillion dollar words: a new financial dataset, task & market analysis[EB/OL].[2024-04-16]. https://arxiv.org/abs/2305.07972.
[29] ANKUR S, TANMAY K. Impact of news on the commodity market: dataset and results[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2009.04202.
[30] JUNGWIRTH G, SAHA A, SCHR?DER M, et al. Connecting the dotfiles: checked-in secret exposure with extra (lateral movement) steps[C]//Proceedings of the 2023 IEEE/ACM 20th International Conference on Mining Software Repositories. Piscataway: IEEE, 2023: 322-333.
[31] CHEN Z Y, CHEN W H, CHARESE S, et al. FinQA: a dataset of numerical reasoning over financial data[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2109.00122.
[32] ZENG H, XUE J, HAO M, et al. Evaluating the generation capabilities of large Chinese language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2308.04823.
[33] LI Y C, GUO Y H, FRANK G, et al. Evaluating large language models for generalization and robustness via data compression[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2402.00861.
[34] PASSALIS N, KANNIAINEN J, GABBOUJ M. Forecasting financial time series using robust deep adaptive input normalization[J]. Journal of Signal Processing Systems, 2021, 93(10): 1235-1251.
[35] DAN H, COLLIN B, STEVEN B, et al. Measuring massive multitask language understanding[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2009.03300.
[36] XU L, LI A, ZHU L, et al. SuperCLUE: a comprehensive Chinese large language model benchmark[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2307.15020.
[37] HUANG Y Z, BAI Y Z, ZHU Z H, et al. C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2305.08322.
[38] ROHAN A, SEBASTIAN B, JEAN B A, et al. Gemini: a family of highly capable multimodal models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2312.11805.
[39] CHU Y, XU J, YANG Q, et al. Qwen2-Audio technical report[EB/OL]. [2024-04-16]. https://arxiv.org/html/2407.10759v1.
[40] 赵志枭, 胡蝶, 刘畅, 等. 人文社科领域中文通用大模型性能评测[J]. 图书情报工作, 2024, 68(13): 132-143.
ZHAO Z X, HU D, LIU C, et al. Performance evaluation of Chinese general-purpose large models in the humanities and social sciences[J]. Library and Information Service, 2024, 68(13): 132-143.
[41] YU Y Y, LI H H, CHEN Z, et al. FinMem: a performance-enhanced LLM trading agent with layered memory and character design[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2311.13743.
[42] AMEET D, VISHVAK M, TANMAY R, et al. Toxicity in ChatGPT: analyzing persona-assigned language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2304.05335.
[43] XIE Q Q, HAN W G, LAI Y Z, et al. The wall street neophyte: a zero-shot analysis of ChatGPT over multimodal stock movement prediction challenges[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2304.05351.
[44] lm-sys/llm-decontaminator[EB/OL]. (2023-07-18) [2024-04-16]. https://github.com/lm-sys/llm-decontaminator.
[45] HU G, QIN K, YUAN C H. No language is an island: unifying Chinese and English in financial large language models, ins-truction data, and benchmarks[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2403.06249.
[46] BENJAMIN C, ALEXANDRU C, FREDERICK N, et al. Large language models in the workplace: a case study on prompt engineering for job type classification[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2303.07142.
[47] DERO A, GHOSH K, GHOSH S. How ready are pre-trained abstractive models and LLMs for legal case judgement summa-rization?[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2306.01248.
[48] BANG Y, SAMUEL C, LEE N. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2302.04023.
[49] Proceedings of the Third Conference on Machine Translation: Research Papers[EB/OL]. (2018-08-18) [2024-04-16].https://aclanthology.org/W18-63.
[50] PILLUTLA K, SWAYAMDIPTA S, ROWAN Z. MAUVE: measuring the gap between neural text and human text using divergence frontiers[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2102.01454.
[51] RAWAL A, WANG H, ZHEN Y J, et al. SMLT-MUGC: small, medium, and large texts-machine versus user-generated content detection and comparison[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2407.12815.
[52] ARIEL R A. A monthly effect in stock returns[J]. Journal of Financial Economics, 1987, 18(1): 161-174.
[53] ZHOU X Z, ZHOU H, LONG H G. Forecasting the equity premium: do deep neural network models work?[EB/OL].(2023-07-23)[2024-04-16]. https://api.semanticscholar.org/CorpusID:261543571.
[54] Streetwise: the best of the journal of portfolio management[EB/OL]. (1998-02-10)[2024-07-23]. https://api.semanticscholar.org/CorpusID:152726169.
[55] REN Y Y, YE H R, FANG H J, et al. ValueBench: towards comprehensively evaluating value orientations and understanding of large language models[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2406.04214.