Evaluating Accounting Reasoning Capabilities of Large Language Models

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study addresses the challenge of evaluating reasoning capabilities of large language models (LLMs) in the accounting domain by proposing the first systematic benchmarking framework tailored to this vertical field. The framework incorporates characteristics of model training data and introduces quantifiable, domain-specific tasks and evaluation criteria. Through carefully designed prompt engineering, the authors conduct a comprehensive assessment of GLM-6B, GLM-130B, GLM-4, and GPT-4. Results indicate that prompt formulation significantly influences model performance, with GPT-4 demonstrating overall superiority. Nevertheless, all evaluated models fall short of meeting the reliability standards required for enterprise-level accounting applications. This work establishes a methodological foundation and empirical benchmark for assessing LLM capabilities in specialized professional domains.

Technology Category

Application Category

📝 Abstract

Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.

Problem

Research questions and friction points this paper is trying to address.

accounting reasoning

large language models

enterprise accounting

model evaluation

professional domain integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

accounting reasoning

large language models

evaluation framework