🤖 AI Summary
Existing financial evaluation benchmarks rely predominantly on static textual sources (e.g., financial reports, news), failing to capture the dynamic, interactive nature of real-world financial meetings and lacking a unified, multilingual, multi-industry, and multi-task assessment framework.
Method: We introduce FinMeet—the first benchmark explicitly designed for authentic financial meeting scenarios—covering English, Chinese, and Japanese; all 11 GICS industries; and three core tasks: meeting summarization, QA pair extraction, and question answering. We extend evaluation from static documents to dynamic meeting dialogues, achieving orthogonal coverage across language, industry, and task. Annotation leverages human verification and rule-based enhancement to ensure high quality and reproducibility.
Results: Experiments on seven mainstream LLMs reveal that the best-performing model achieves only a 62.3% average F1 across tasks, exposing critical limitations in long-context comprehension. FinMeet establishes a novel standard and diagnostic tool for evaluating LLM capabilities in finance.
📝 Abstract
Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $ exttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $ exttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $ exttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $ exttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs' financial meeting comprehension skills.