TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Table quality assessment has long been hindered by the inability of conventional metrics to capture fine-grained structural and semantic discrepancies. To address this, we propose a two-stage evaluation framework: TabAlign for structural alignment, followed by TabCompare for joint syntactic and semantic comparison. Our contributions are threefold: (1) the first evaluation rubric integrating multi-level structural descriptions with contextual quantification; (2) TabXBench—the first cross-domain benchmark featuring realistic perturbations and expert-annotated ground truth; and (3) fully attributable and traceable evaluation outcomes. On TabXBench, our method significantly outperforms established baselines—including BLEU, ROUGE, and BERTScore—achieving a 42.3% improvement in detecting subtle structural and content errors. Domain experts strongly endorse its interpretability, yielding a high inter-annotator agreement (Cohen’s κ = 0.89).

Technology Category

Application Category

📝 Abstract
Evaluating tables qualitatively&quantitatively presents a significant challenge, as traditional metrics often fail to capture nuanced structural and content discrepancies. To address this, we introduce a novel, methodical rubric integrating multi-level structural descriptors with fine-grained contextual quantification, thereby establishing a robust foundation for comprehensive table comparison. Building on this foundation, we propose TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval initially aligns reference tables structurally via TabAlign&subsequently conducts a systematic semantic and syntactic comparison using TabCompare; this approach clarifies the evaluation process and pinpoints subtle discrepancies overlooked by conventional methods. The efficacy of this framework is assessed using TabXBench, a novel, diverse, multi-domain benchmark we developed, featuring realistic table perturbations and human-annotated assessments. Finally, a systematic analysis of existing evaluation methods through sensitivity-specificity trade-offs demonstrates the qualitative and quantitative effectiveness of TabXEval across diverse table-related tasks and domains, paving the way for future innovations in explainable table evaluation.
Problem

Research questions and friction points this paper is trying to address.

Developing a comprehensive rubric for qualitative and quantitative table evaluation
Creating an explainable framework to compare tables structurally and semantically
Assessing evaluation methods using a diverse benchmark with realistic perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level structural descriptors with contextual quantification
Two-phase evaluation framework: TabAlign and TabCompare
TabXBench benchmark for diverse multi-domain assessment