🤖 AI Summary
This study addresses the limited robustness of large language models (LLMs) when confronted with semantically and structurally perturbed tabular data, revealing their inability to autonomously detect and correct errors. The work presents the first systematic evaluation of LLMs on table-based question answering tasks that require error correction prior to reasoning. To this end, the authors construct an expert-annotated dataset containing realistic perturbations and conduct comprehensive experiments involving prompt engineering and multi-model comparisons. Results show that even advanced models such as GPT-5.2 suffer accuracy drops exceeding 22% under perturbation; while strategic prompting partially mitigates performance degradation, it fails to fully restore original capabilities. These findings underscore LLMs’ strong reliance on explicit instructions and motivate a novel direction—“human-like adaptive alignment”—to enhance autonomous error-correction capacities.
📝 Abstract
We investigate how large language models (LLMs) fail when tabular data in an otherwise canonical representation is subjected to semantic and structural distortions. Our findings reveal that LLMs lack an inherent ability to detect and correct subtle distortions in table representations. Only when provided with an explicit prior, via a system prompt, do models partially adjust their reasoning strategies and correct some distortions, though not consistently or completely. To study this phenomenon, we introduce a small, expert-curated dataset that explicitly evaluates LLMs on table question answering (TQA) tasks requiring an additional error-correction step prior to analysis. Our results reveal systematic differences in how LLMs ingest and interpret tabular information under distortion, with even SoTA models such as GPT-5.2 model exhibiting a drop of minimum 22% accuracy under distortion. These findings raise important questions for future research, particularly regarding when and how models should autonomously decide to realign tabular inputs, analogous to human behavior, without relying on explicit prompts or tabular data pre-processing.