[1] OpenAI. GPT-4 technical report[EB/OL]. [2024-04-16]. https:// arxiv.org/abs/2303.08774.
[2] BUBECK S, CHAN V, ELDAN R. Sparks of artificial general intelligence: early experiments with GPT-4[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2303.12712.
[3] DAI Y, FENG D, HUANG J, et al. LAiW: a Chinese legal large language models benchmark[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2310.05620.
[4] LEI Y, LI J, CHENG D. CFBenchmark: Chinese financial assistant benchmark for large language model[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2311.05812.
[5] FENG D, DAI Y, HUANG J. Empowering many, biasing a few: generalist credit scoring through large language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2310.00566.
[6] XIE Q, HAN W, ZHANG X, et al. PIXIU: a large language model, instruction data and evaluation benchmark for finance [EB/OL]. [2024-04-16]. https://arxiv.org/abs/2211.00083.
[7] LU D, WU H, LIANG J, et al. BBT-Fin: comprehensive construction of Chinese financial domain pre-trained language model, corpus and benchmark[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2302.09432.
[8] ZHANG L, CAI W, LIU Z, et al. FinEval: a Chinese financial domain knowledge evaluation benchmark for large language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2308.09975.
[9] YANG S, CHANG W L, ZHENG L, et al. Rethinking benchmark and contamination for language models with rephrased samples[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2311.04850.
[10] ZHANG H, DA J, LEE D, et al. A careful examination of large language model performance on grade school arithmetic[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2405.00332.
[11] YANG H, LIU X Y, WANG C D. FinGPT: open-source financial large language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2306.06031.
[12] HUHO T, THIBAUT L, GAUTIER I, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2302.13971.
[13] YU Y M. Cornucopia-LLaMA-Fin-Chinese[EB/OL]. (2023-05-10)[2024-04-16]. https://github.com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese.
[14] CHEN W, WANG Q, LONG Z, et al. DISC-FinLLM: a Chinese financial large language model based on multiple experts fine-tuning[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2310.15205.
[15] WU S J, OZAN I, STEVEN L, et al. BloombergGPT: a large language model for finance[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2303.17564.
[16] ZHANG X, YANG Q, XU D. XuanYuan 2.0: a large Chinese financial chat model with hundreds of billions parameters[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2305.12002.
[17] 申丽萍, 何朝帆, 曹东旭, 等. 大语言模型在中学历史学科中的应用测评分析[J]. 现代教育技术, 2024, 34(2): 62-71.
SHEN L P, HE C F, CAO D X, et al. Evaluation and analysis of the application of large language models in secondary school history[J]. Modern Educational Technology, 2024, 34(2): 62-71.
[18] PANG C X, CAO Y X, YANG C H, et al. Uncovering limitations of large language models in information seeking from tables[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2406.04113.
[19] CHEN M, JERRY T, JUN H, et al. Evaluating large language models trained on code[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2107.03374.
[20] 罗世杰. 金融大模型: 应用、风险与制度应对[J]. 金融发展研究, 2024, 9(6): 70-78.
LUO S J. Financial large models: applications, risks, and institutional responses[J]. Financial Development Research, 2024, 9(6): 70-78.
[21] 许志伟, 李海龙, 李博, 等. AIGC大模型测评综述:使能技术、安全隐患和应对[J]. 计算机科学与探索, 2024, 18(9): 2293-2325.
XU Z W, LI H L, LI B, et al. Survey of AIGC large model evaluation: enabling technologies, vulnerabilities and mitigation[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2293-2325.
[22] ZHAO H Q, LIU Z L, WU Z, et al. Revolutionizing finance with LLMs: an overview of applications and insights[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2401.11641.
[23] ADLAKHA V, PARISHAD B G, LU X H. Evaluating correctness and faithfulness of instruction-following models for question answering[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2307.16877.
[24] 邱冬阳, 蓝宇. ChatGPT给金融行业带来的机遇、挑战及问题[J]. 西南金融, 2023(6): 18-29.
QIU D Y, LAN Y. Opportunities, challenges, and issues brought to the financial industry by ChatGPT[J]. Southwest Finance, 2023(6): 18-29.
[25] MA X T, MOHAMMAD J D, WANG C H, et al. SimulEval: an evaluation toolkit for simultaneous translation[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2007.16193.
[26] LUANA B, ALDO G, MISAEL M. Do language models understand morality? Towards a robust detection of moral content[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2406.04143.
[27] RAJ S S, KUNAL C, DHEERAJ E, et al. When FLUE meets FLANG: benchmarks and large pre-trained language model for financial domain[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2211.00083.
[28] AGAM S, SUVAN P, SUDHEER C. Trillion dollar words: a new financial dataset, task & market analysis[EB/OL].[2024-04-16]. https://arxiv.org/abs/2305.07972.
[29] ANKUR S, TANMAY K. Impact of news on the commodity market: dataset and results[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2009.04202.
[30] JUNGWIRTH G, SAHA A, SCHR?DER M, et al. Connecting the dotfiles: checked-in secret exposure with extra (lateral movement) steps[C]//Proceedings of the 2023 IEEE/ACM 20th International Conference on Mining Software Repositories. Piscataway: IEEE, 2023: 322-333.
[31] CHEN Z Y, CHEN W H, CHARESE S, et al. FinQA: a dataset of numerical reasoning over financial data[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2109.00122.
[32] ZENG H, XUE J, HAO M, et al. Evaluating the generation capabilities of large Chinese language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2308.04823.
[33] LI Y C, GUO Y H, FRANK G, et al. Evaluating large language models for generalization and robustness via data compression[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2402.00861.
[34] PASSALIS N, KANNIAINEN J, GABBOUJ M. Forecasting financial time series using robust deep adaptive input normalization[J]. Journal of Signal Processing Systems, 2021, 93(10): 1235-1251.
[35] DAN H, COLLIN B, STEVEN B, et al. Measuring massive multitask language understanding[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2009.03300.
[36] XU L, LI A, ZHU L, et al. SuperCLUE: a comprehensive Chinese large language model benchmark[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2307.15020.
[37] HUANG Y Z, BAI Y Z, ZHU Z H, et al. C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2305.08322.
[38] ROHAN A, SEBASTIAN B, JEAN B A, et al. Gemini: a family of highly capable multimodal models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2312.11805.
[39] CHU Y, XU J, YANG Q, et al. Qwen2-Audio technical report[EB/OL]. [2024-04-16]. https://arxiv.org/html/2407.10759v1.
[40] 赵志枭, 胡蝶, 刘畅, 等. 人文社科领域中文通用大模型性能评测[J]. 图书情报工作, 2024, 68(13): 132-143.
ZHAO Z X, HU D, LIU C, et al. Performance evaluation of Chinese general-purpose large models in the humanities and social sciences[J]. Library and Information Service, 2024, 68(13): 132-143.
[41] YU Y Y, LI H H, CHEN Z, et al. FinMem: a performance-enhanced LLM trading agent with layered memory and character design[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2311.13743.
[42] AMEET D, VISHVAK M, TANMAY R, et al. Toxicity in ChatGPT: analyzing persona-assigned language models[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2304.05335.
[43] XIE Q Q, HAN W G, LAI Y Z, et al. The wall street neophyte: a zero-shot analysis of ChatGPT over multimodal stock movement prediction challenges[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2304.05351.
[44] lm-sys/llm-decontaminator[EB/OL]. (2023-07-18) [2024-04-16]. https://github.com/lm-sys/llm-decontaminator.
[45] HU G, QIN K, YUAN C H. No language is an island: unifying Chinese and English in financial large language models, ins-truction data, and benchmarks[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2403.06249.
[46] BENJAMIN C, ALEXANDRU C, FREDERICK N, et al. Large language models in the workplace: a case study on prompt engineering for job type classification[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2303.07142.
[47] DERO A, GHOSH K, GHOSH S. How ready are pre-trained abstractive models and LLMs for legal case judgement summa-rization?[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2306.01248.
[48] BANG Y, SAMUEL C, LEE N. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2302.04023.
[49] Proceedings of the Third Conference on Machine Translation: Research Papers[EB/OL]. (2018-08-18) [2024-04-16].https://aclanthology.org/W18-63.
[50] PILLUTLA K, SWAYAMDIPTA S, ROWAN Z. MAUVE: measuring the gap between neural text and human text using divergence frontiers[EB/OL]. [2024-04-16]. https://arxiv.org/abs/2102.01454.
[51] RAWAL A, WANG H, ZHEN Y J, et al. SMLT-MUGC: small, medium, and large texts-machine versus user-generated content detection and comparison[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2407.12815.
[52] ARIEL R A. A monthly effect in stock returns[J]. Journal of Financial Economics, 1987, 18(1): 161-174.
[53] ZHOU X Z, ZHOU H, LONG H G. Forecasting the equity premium: do deep neural network models work?[EB/OL].(2023-07-23)[2024-04-16]. https://api.semanticscholar.org/CorpusID:261543571.
[54] Streetwise: the best of the journal of portfolio management[EB/OL]. (1998-02-10)[2024-07-23]. https://api.semanticscholar.org/CorpusID:152726169.
[55] REN Y Y, YE H R, FANG H J, et al. ValueBench: towards comprehensively evaluating value orientations and understanding of large language models[EB/OL]. [2024-07-23]. https://arxiv.org/abs/2406.04214. |