Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Automated verification of financial report paragraphs against accounting standards remains a critical challenge for AI auditing systems. Method: This study proposes a multilingual compliance verification framework leveraging large language models (LLMs), systematically evaluating open- and closed-source models—including Llama-2 (70B), GPT-3.5, and GPT-4—on a bilingual, domain-specific dataset curated by PwC. Contribution/Results: Empirical results reveal that Llama-2 70B significantly outperforms closed-source models in detecting non-compliant cases, highlighting the untapped potential of open-weight LLMs for specialized regulatory tasks. Conversely, GPT-4 achieves superior overall performance across diverse scenarios, particularly in non-English contexts. The study validates the feasibility of LLMs for regulatory compliance auditing and provides evidence-based guidance for selecting and deploying high-assurance LLMs in financial auditing applications.

Technology Category

Application Category

📝 Abstract

The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.

Problem

Research questions and friction points this paper is trying to address.

Automating compliance verification in financial auditing using LLMs

Comparing open-source and proprietary LLMs for regulatory compliance

Evaluating LLM performance in detecting non-compliance across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Language Models for compliance verification

Compares open-source and proprietary LLM configurations

Leverages custom datasets from PwC Germany

🔎 Similar Papers

No similar papers found.