🤖 AI Summary
This study addresses the inadequate performance of existing large language models in Chinese taxation—a highly specialized domain with stringent legal constraints—and the absence of a comprehensive, practice-oriented evaluation benchmark. To bridge this gap, we propose TaxPraBen, the first extensible Chinese tax evaluation benchmark, encompassing ten traditional tasks and three real-world scenarios: risk control, audit analysis, and tax planning. Constructed from 7.3K instances, TaxPraBen employs a structured evaluation paradigm combining structured parsing, field-aligned extraction, and numerical-textual matching, while integrating Bloom’s taxonomy for multidimensional capability assessment. Empirical evaluation of 19 mainstream models demonstrates the benchmark’s validity and discriminative power, revealing that closed-source models generally lead, Chinese-native models (e.g., Qwen2.5) outperform multilingual counterparts, and lightweight fine-tuning (e.g., YaYi2) yields limited gains.
📝 Abstract
While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of"structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.