UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing financial LLM evaluation benchmarks fail to reflect user satisfaction and practical utility in real-world financial scenarios. To address this, we propose UCFE—the first user-centered financial LLM evaluation benchmark. Built upon authentic financial intent and interaction data from 804 users, UCFE employs an LLM-as-Judge dynamic assessment paradigm to jointly evaluate 11 mainstream models across complex financial tasks—including financial planning and risk identification. Departing from conventional static, expert-driven evaluation, UCFE introduces a novel hybrid framework integrating explicit user feedback with dynamic task interaction. Experimental results demonstrate strong alignment between UCFE scores and human preferences (Pearson *r* = 0.78), effectively exposing critical capability gaps—particularly in domain-specific reasoning and actionable advice generation—and identifying concrete optimization pathways for financial LLMs.

Technology Category

Application Category

📝 Abstract
This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' financial task proficiency.
Simulate complex financial scenarios effectively.
Assess user satisfaction with LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid evaluation approach
LLM-as-Judge methodology
User-centric dataset creation
🔎 Similar Papers
No similar papers found.
Y
Yuzhe Yang
The Chinese University of Hong Kong, Shenzhen
Y
Yifei Zhang
Nanjing University
Y
Yan Hu
The Chinese University of Hong Kong, Shenzhen
Yilin Guo
Yilin Guo
The Chinese University of Hong Kong, Shenzhen
R
Ruoli Gan
The Chinese University of Hong Kong, Shenzhen
Yueru He
Yueru He
Columbia University
FinanceLarge Language Models
Mingcong Lei
Mingcong Lei
Chinese University of Hong Kong, Shenzhen
AIAgentEmbodiedDeep Learning
X
Xiao Zhang
The Fin AI
H
Haining Wang
Nanjing University
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
Jimin Huang
Jimin Huang
The Fin AI
computational finance
H
Honghai Yu
Nanjing University
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning