FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While large language models (LLMs) excel at unstructured text understanding, their structured reasoning capabilities—particularly for financial audit documents adhering to GAAP/XBRL standards, featuring hierarchical organization and semantic constraints—lack systematic evaluation. Method: We introduce FinAuditing, the first taxonomy-aligned, structure-aware multi-document benchmark for financial auditing, comprising three subtasks: semantic matching, relation extraction, and numeric consistency. Built from real XBRL filings, it integrates accounting principles and document hierarchy into a unified retrieval-classification-reasoning evaluation framework, employing zero-shot prompting and fine-grained metrics. Contribution/Results: Evaluating 13 state-of-the-art LLMs reveals a dramatic 60–90% accuracy drop on hierarchical reasoning tasks, exposing fundamental limitations in structured financial reasoning. FinAuditing thus provides a rigorous, domain-grounded testbed to diagnose and advance LLM capabilities for audit-relevant structured inference.

Technology Category

Application Category

📝 Abstract
The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' reasoning over structured financial documents
Assessing semantic, relational and numerical consistency in auditing
Addressing limitations in taxonomy-grounded financial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed taxonomy-structured multi-document financial auditing benchmark
Proposed unified framework integrating retrieval classification reasoning metrics
Evaluated LLMs on semantic relational numerical consistency subtasks
🔎 Similar Papers
No similar papers found.