TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses the inadequate performance of existing large language models in Chinese taxation—a highly specialized domain with stringent legal constraints—and the absence of a comprehensive, practice-oriented evaluation benchmark. To bridge this gap, we propose TaxPraBen, the first extensible Chinese tax evaluation benchmark, encompassing ten traditional tasks and three real-world scenarios: risk control, audit analysis, and tax planning. Constructed from 7.3K instances, TaxPraBen employs a structured evaluation paradigm combining structured parsing, field-aligned extraction, and numerical-textual matching, while integrating Bloom’s taxonomy for multidimensional capability assessment. Empirical evaluation of 19 mainstream models demonstrates the benchmark’s validity and discriminative power, revealing that closed-source models generally lead, Chinese-native models (e.g., Qwen2.5) outperform multilingual counterparts, and lightweight fine-tuning (e.g., YaYi2) yields limited gains.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of"structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Chinese Tax Domain

Real-World Tax Practice

Benchmark Evaluation

Structured Assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured evaluation

real-world tax practice

scalable benchmark