Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the overestimation of large language models’ (LLMs) performance in tax law reasoning due to training data contamination, which obscures their true reasoning capabilities. To mitigate this issue, the authors propose a contamination-aware evaluation framework that integrates a contamination detection protocol, a mechanism for translating statutory texts into formal representations, a symbolic reasoning solver, and a test suite based on rule- and case-based perturbations. This framework enables systematic assessment of both end-to-end LLMs and neuro-symbolic hybrid systems in terms of reliability and generalization. Experimental results demonstrate that data contamination substantially inflates model performance, while neuro-symbolic approaches exhibit superior robustness and compositional generalization on unseen tax law scenarios, thereby revealing the inherently structural nature of legal reasoning.

📝 Abstract

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

Problem

Research questions and friction points this paper is trying to address.

legal reasoning

data contamination

tax law

neuro-symbolic systems

LLM robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

contamination-aware evaluation

neuro-symbolic reasoning

legal AI