RADAR: Benchmarking Language Models on Imperfect Tabular Data

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current language models exhibit weak “data awareness” when processing imperfect tabular data—such as missing values, outliers, and logical inconsistencies—severely compromising analytical reliability. To address this gap, we introduce RADAR, the first dedicated benchmark for evaluating model robustness to real-world tabular data imperfections. Our method comprises: (1) a novel programmatic data perturbation framework covering five realistic defect types across nine domains; (2) the first incorporation of controllable table-scale variables to systematically expose model robustness bottlenecks under progressive data quality degradation; and (3) a multi-dimensional structured query set paired with a cross-scale evaluation protocol. Experiments reveal that state-of-the-art models perform well on clean tables but suffer substantial performance drops under perturbations—confirming their inadequate data perception capabilities. RADAR is publicly released, supporting extensible perturbation generation and fine-grained scale control.

Technology Category

Application Category

📝 Abstract
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LMs' ability to handle imperfect tabular data artifacts
Assessing model robustness against missing values and inconsistencies
Benchmarking performance degradation with increasing table sizes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LMs on imperfect tabular data
Simulating data artifacts via programmatic perturbations
Evaluating artifact handling across diverse domains
🔎 Similar Papers
No similar papers found.
Ken Gu
Ken Gu
Paul G. Allen School of Computer Science & Engineering, University of Washington
Data ScienceNatural Language ProcessingHuman-Computer Interaction
Zhihan Zhang
Zhihan Zhang
PhD student, University of Notre Dame
Natural Language Processing
K
Kate Lin
Google Research
Y
Yuwei Zhang
Google Research
Akshay Paruchuri
Akshay Paruchuri
PhD Student, University of North Carolina at Chapel Hill
Computer VisionMachine LearningNatural Language ProcessingMultimodal AIHealthcare
H
Hong Yu
Google Research
Mehran Kazemi
Mehran Kazemi
Staff Research Scientist, Google DeepMind
Machine LearningLarge Language ModelsReasoningArtificial General Intelligence
Kumar Ayush
Kumar Ayush
Google | Stanford University | Indian Institute of Technology Kharagpur
Foundation ModelsLarge Language ModelsGenerative AIRLHF
A
A. Heydari
Google Research
M
Maxwell A. Xu
Google Research
Girish Narayanswamy
Girish Narayanswamy
UbiComp Lab, University of Washington
Health SensingSignal ProcessingMachine LearningArtificial IntelligenceEmbedded Systems
Y
Yun Liu
Google Research
Ming-Zher Poh
Ming-Zher Poh
Google, MIT
machine learningphysiological sensingwearable sensorsmobile healthcomputational physiology
Y
Yuzhe Yang
Google Research
M
Mark Malhotra
Google Research
Shwetak Patel
Shwetak Patel
University of Washington, Washington Research Foundation Endowed Professor, Computer Science
Ubiquitous ComputingHuman-Computer InteractionSensorsEmbedded Systems
Hamid Palangi
Hamid Palangi
Google and University of Washington
Artificial IntelligenceMachine LearningNatural Language Processing
Xuhai Xu
Xuhai Xu
Assistant Professor, Columbia University | Google
Human-Computer InteractionUbiquitous ComputingHuman-Centered AImHealthHealth Informatics
D
D. McDuff
Google Research
Tim Althoff
Tim Althoff
Associate Professor of Computer Science, University of Washington
Human AI InteractionNatural Language ProcessingBehavioral Data ScienceAI for Mental Health
X
Xin Liu
Google Research