🤖 AI Summary
Existing financial large language models (FinLLMs) lack rigorous evaluation frameworks for complex regulatory rule comprehension and compliance reasoning. Method: We introduce COLING 2025 Financial Regulatory Challenge—the first multi-task benchmark dedicated to financial regulatory understanding—comprising nine novel tasks covering core compliance scenarios such as clause provenance tracing and conflict detection. It employs zero-shot and few-shot evaluation across mainstream models (e.g., Llama, Qwen, Phi) and ensures data quality via expert validation. Contribution/Results: Aggregated results from 23 participating teams reveal that state-of-the-art FinLLMs achieve an average accuracy of less than 62% on critical tasks, exposing severe limitations in regulatory semantic understanding and compliance reasoning. This work establishes the first systematic characterization of FinLLMs’ capability boundaries in regulatory interpretation, providing a foundational benchmark and actionable insights for developing trustworthy financial AI systems.
📝 Abstract
Financial large language models (FinLLMs) have been applied to various tasks in business, finance, accounting, and auditing. Complex financial regulations and standards are critical to financial services, which LLMs must comply with. However, FinLLMs' performance in understanding and interpreting financial regulations has rarely been studied. Therefore, we organize the Regulations Challenge, a shared task at COLING 2025. It encourages the academic community to explore the strengths and limitations of popular LLMs. We create 9 novel tasks and corresponding question sets. In this paper, we provide an overview of these tasks and summarize participants' approaches and results. We aim to raise awareness of FinLLMs' professional capability in financial regulations.