An Empirical Investigation of Robustness in Large Language Models under Tabular Distortions

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the limited robustness of large language models (LLMs) when confronted with semantically and structurally perturbed tabular data, revealing their inability to autonomously detect and correct errors. The work presents the first systematic evaluation of LLMs on table-based question answering tasks that require error correction prior to reasoning. To this end, the authors construct an expert-annotated dataset containing realistic perturbations and conduct comprehensive experiments involving prompt engineering and multi-model comparisons. Results show that even advanced models such as GPT-5.2 suffer accuracy drops exceeding 22% under perturbation; while strategic prompting partially mitigates performance degradation, it fails to fully restore original capabilities. These findings underscore LLMs’ strong reliance on explicit instructions and motivate a novel direction—“human-like adaptive alignment”—to enhance autonomous error-correction capacities.

Technology Category

Application Category

📝 Abstract

We investigate how large language models (LLMs) fail when tabular data in an otherwise canonical representation is subjected to semantic and structural distortions. Our findings reveal that LLMs lack an inherent ability to detect and correct subtle distortions in table representations. Only when provided with an explicit prior, via a system prompt, do models partially adjust their reasoning strategies and correct some distortions, though not consistently or completely. To study this phenomenon, we introduce a small, expert-curated dataset that explicitly evaluates LLMs on table question answering (TQA) tasks requiring an additional error-correction step prior to analysis. Our results reveal systematic differences in how LLMs ingest and interpret tabular information under distortion, with even SoTA models such as GPT-5.2 model exhibiting a drop of minimum 22% accuracy under distortion. These findings raise important questions for future research, particularly regarding when and how models should autonomously decide to realign tabular inputs, analogous to human behavior, without relying on explicit prompts or tabular data pre-processing.

Problem

Research questions and friction points this paper is trying to address.

large language models

tabular distortions

robustness

table question answering

error correction

Innovation

Methods, ideas, or system contributions that make the work stand out.

tabular robustness

large language models

table question answering