Evaluating Accounting Reasoning Capabilities of Large Language Models

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of evaluating reasoning capabilities of large language models (LLMs) in the accounting domain by proposing the first systematic benchmarking framework tailored to this vertical field. The framework incorporates characteristics of model training data and introduces quantifiable, domain-specific tasks and evaluation criteria. Through carefully designed prompt engineering, the authors conduct a comprehensive assessment of GLM-6B, GLM-130B, GLM-4, and GPT-4. Results indicate that prompt formulation significantly influences model performance, with GPT-4 demonstrating overall superiority. Nevertheless, all evaluated models fall short of meeting the reliability standards required for enterprise-level accounting applications. This work establishes a methodological foundation and empirical benchmark for assessing LLM capabilities in specialized professional domains.

Technology Category

Application Category

📝 Abstract
Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.
Problem

Research questions and friction points this paper is trying to address.

accounting reasoning
large language models
enterprise accounting
model evaluation
professional domain integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

accounting reasoning
large language models
evaluation framework
prompt design
domain-specific reasoning
🔎 Similar Papers
No similar papers found.
J
Jie Zhou
School of Computer Engineering, Jiangsu Ocean University
Xin Chen
Xin Chen
Principal investigator, Shanghai Jiao Tong University School of Medicine
Mechanobiology in Neuro-oncology
Jie Zhang
Jie Zhang
Unknown affiliation
H
Hai Li
School of Computer Engineering, Jiangsu Ocean University
Jie Wang
Jie Wang
Professor of Computer Science, University of Massachusetts Lowell
text AImodeling and optimizationcomputational complexitynetworks and security
Z
Zhe Li
School of Computer Engineering, Jiangsu Ocean University