FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Financial natural language inference (NLI) suffers from severe domain transfer challenges: general-purpose models exhibit substantial performance degradation on heterogeneous financial texts—including SEC filings, annual reports, and earnings call transcripts. To address this, we introduce FinNLI, the first NLI benchmark spanning diverse financial text sources, comprising 21,304 samples and a rigorously curated, expert-annotated test set of 3,304 instances. Our methodology innovatively integrates expert collaborative annotation, multi-source alignment sampling, and spurious correlation suppression to systematically mitigate superficial statistical biases. Empirical results reveal that state-of-the-art pre-trained language models (PLMs) and large language models (LLMs) achieve only 74.57% and 78.62% zero-shot macro F1, respectively; notably, instruction tuning fails to improve—often degrading—performance, exposing fundamental deficits in financial logical reasoning. FinNLI thus establishes a robust, realistic benchmark for evaluating and advancing model generalization in practical financial inference tasks.

Technology Category

Application Category

📝 Abstract
We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset's difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.
Problem

Research questions and friction points this paper is trying to address.

Creating a financial NLI benchmark dataset (FinNLI) for diverse financial texts
Addressing domain shift issues degrading general NLI model performance
Exposing weaknesses in LLMs for financial reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

FinNLI dataset for financial NLI benchmarking
Diverse premise-hypothesis pairs minimizing spurious correlations
Evaluates domain shift impact on NLI performance
🔎 Similar Papers
No similar papers found.