🤖 AI Summary
This study addresses the challenge of evaluating reasoning capabilities of large language models (LLMs) in the accounting domain by proposing the first systematic benchmarking framework tailored to this vertical field. The framework incorporates characteristics of model training data and introduces quantifiable, domain-specific tasks and evaluation criteria. Through carefully designed prompt engineering, the authors conduct a comprehensive assessment of GLM-6B, GLM-130B, GLM-4, and GPT-4. Results indicate that prompt formulation significantly influences model performance, with GPT-4 demonstrating overall superiority. Nevertheless, all evaluated models fall short of meeting the reliability standards required for enterprise-level accounting applications. This work establishes a methodological foundation and empirical benchmark for assessing LLM capabilities in specialized professional domains.
📝 Abstract
Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.