Evaluating Accounting Reasoning Capabilities of Large Language Models

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

161K/year
🤖 AI Summary
This study addresses the challenge of evaluating reasoning capabilities of large language models (LLMs) in the accounting domain by proposing the first systematic benchmarking framework tailored to this vertical field. The framework incorporates characteristics of model training data and introduces quantifiable, domain-specific tasks and evaluation criteria. Through carefully designed prompt engineering, the authors conduct a comprehensive assessment of GLM-6B, GLM-130B, GLM-4, and GPT-4. Results indicate that prompt formulation significantly influences model performance, with GPT-4 demonstrating overall superiority. Nevertheless, all evaluated models fall short of meeting the reliability standards required for enterprise-level accounting applications. This work establishes a methodological foundation and empirical benchmark for assessing LLM capabilities in specialized professional domains.

Technology Category

Application Category

📝 Abstract
Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.
Problem

Research questions and friction points this paper is trying to address.

accounting reasoning
large language models
enterprise accounting
model evaluation
professional domain integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

accounting reasoning
large language models
evaluation framework
prompt design
domain-specific reasoning
🔎 Similar Papers
2024-02-17Annual Meeting of the Association for Computational LinguisticsCitations: 26