M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing financial evaluation benchmarks rely predominantly on static textual sources (e.g., financial reports, news), failing to capture the dynamic, interactive nature of real-world financial meetings and lacking a unified, multilingual, multi-industry, and multi-task assessment framework. Method: We introduce FinMeet—the first benchmark explicitly designed for authentic financial meeting scenarios—covering English, Chinese, and Japanese; all 11 GICS industries; and three core tasks: meeting summarization, QA pair extraction, and question answering. We extend evaluation from static documents to dynamic meeting dialogues, achieving orthogonal coverage across language, industry, and task. Annotation leverages human verification and rule-based enhancement to ensure high quality and reproducibility. Results: Experiments on seven mainstream LLMs reveal that the best-performing model achieves only a 62.3% average F1 across tasks, exposing critical limitations in long-context comprehension. FinMeet establishes a novel standard and diagnostic tool for evaluating LLM capabilities in finance.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $ exttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $ exttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $ exttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $ exttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs' financial meeting comprehension skills.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs in multilingual financial meeting contexts
Addresses gaps in current financial benchmarks dynamics
Assesses multi-task understanding via summarization and QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset for financial meeting understanding
Covers diverse sectors via GICS classification
Includes summarization, QA extraction, answering tasks
🔎 Similar Papers
No similar papers found.
J
Jie Zhu
School of Computer Science and Technology, Soochow University
J
Junhui Li
School of Computer Science and Technology, Soochow University
Y
Yalong Wen
Qwen DianJin Team, Alibaba Cloud Computing
X
Xiandong Li
Nanjing University
Lifan Guo
Lifan Guo
Researcher Drexel University
Machine Learning
F
Feng Chen
Qwen DianJin Team, Alibaba Cloud Computing