🤖 AI Summary
Table quality assessment has long been hindered by the inability of conventional metrics to capture fine-grained structural and semantic discrepancies. To address this, we propose a two-stage evaluation framework: TabAlign for structural alignment, followed by TabCompare for joint syntactic and semantic comparison. Our contributions are threefold: (1) the first evaluation rubric integrating multi-level structural descriptions with contextual quantification; (2) TabXBench—the first cross-domain benchmark featuring realistic perturbations and expert-annotated ground truth; and (3) fully attributable and traceable evaluation outcomes. On TabXBench, our method significantly outperforms established baselines—including BLEU, ROUGE, and BERTScore—achieving a 42.3% improvement in detecting subtle structural and content errors. Domain experts strongly endorse its interpretability, yielding a high inter-annotator agreement (Cohen’s κ = 0.89).
📝 Abstract
Evaluating tables qualitatively&quantitatively presents a significant challenge, as traditional metrics often fail to capture nuanced structural and content discrepancies. To address this, we introduce a novel, methodical rubric integrating multi-level structural descriptors with fine-grained contextual quantification, thereby establishing a robust foundation for comprehensive table comparison. Building on this foundation, we propose TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval initially aligns reference tables structurally via TabAlign&subsequently conducts a systematic semantic and syntactic comparison using TabCompare; this approach clarifies the evaluation process and pinpoints subtle discrepancies overlooked by conventional methods. The efficacy of this framework is assessed using TabXBench, a novel, diverse, multi-domain benchmark we developed, featuring realistic table perturbations and human-annotated assessments. Finally, a systematic analysis of existing evaluation methods through sensitivity-specificity trade-offs demonstrates the qualitative and quantitative effectiveness of TabXEval across diverse table-related tasks and domains, paving the way for future innovations in explainable table evaluation.