🤖 AI Summary
Automated verification of financial report paragraphs against accounting standards remains a critical challenge for AI auditing systems.
Method: This study proposes a multilingual compliance verification framework leveraging large language models (LLMs), systematically evaluating open- and closed-source models—including Llama-2 (70B), GPT-3.5, and GPT-4—on a bilingual, domain-specific dataset curated by PwC.
Contribution/Results: Empirical results reveal that Llama-2 70B significantly outperforms closed-source models in detecting non-compliant cases, highlighting the untapped potential of open-weight LLMs for specialized regulatory tasks. Conversely, GPT-4 achieves superior overall performance across diverse scenarios, particularly in non-English contexts. The study validates the feasibility of LLMs for regulatory compliance auditing and provides evidence-based guidance for selecting and deploying high-assurance LLMs in financial auditing applications.
📝 Abstract
The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.