TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the inadequate performance of existing large language models in Chinese taxation—a highly specialized domain with stringent legal constraints—and the absence of a comprehensive, practice-oriented evaluation benchmark. To bridge this gap, we propose TaxPraBen, the first extensible Chinese tax evaluation benchmark, encompassing ten traditional tasks and three real-world scenarios: risk control, audit analysis, and tax planning. Constructed from 7.3K instances, TaxPraBen employs a structured evaluation paradigm combining structured parsing, field-aligned extraction, and numerical-textual matching, while integrating Bloom’s taxonomy for multidimensional capability assessment. Empirical evaluation of 19 mainstream models demonstrates the benchmark’s validity and discriminative power, revealing that closed-source models generally lead, Chinese-native models (e.g., Qwen2.5) outperform multilingual counterparts, and lightweight fine-tuning (e.g., YaYi2) yields limited gains.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of"structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Chinese Tax Domain
Real-World Tax Practice
Benchmark Evaluation
Structured Assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured evaluation
real-world tax practice
scalable benchmark
Chinese LLMs
field alignment extraction
🔎 Similar Papers
No similar papers found.
Gang Hu
Gang Hu
Columbia University
System
Y
Yating Chen
Yunnan University, Yunnan, China.
Haiyan Ding
Haiyan Ding
Tsinghua University
Neonatal cerebral function monitoringCardiac magnetic resonance imaging
W
Wang Gao
Jianghan University, Wuhan, China.
J
Jiajia Huang
Nanjing Audit University, Nanjing, China.
M
Min Peng
Wuhan University, Wuhan, China.
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
K
Kun Yu
Yunnan University, Yunnan, China.