🤖 AI Summary
Existing LLM-based table generation evaluation methods either neglect structural constraints or rely on fixed reference tables, resulting in poor generalizability. This paper introduces TabReX—the first reference-free, graph-structure-driven evaluation framework for table generation. TabReX unifies textual and tabular outputs as knowledge graphs and employs LLM-guided alignment to yield interpretable, quantitative scores measuring both structural fidelity and factual accuracy. Key contributions include: (1) attribute-level graph reasoning modeling; (2) customizable rubrics; (3) controllable sensitivity and specificity; (4) cell-level error tracing; and (5) fine-grained model–prompt analysis. Evaluated on TabReX-Bench—a novel, multidimensional perturbation benchmark covering six domains and twelve perturbation types—TabReX significantly outperforms existing metrics: it achieves the highest correlation with expert rankings and demonstrates superior robustness under strong perturbations. TabReX establishes a new, interpretable paradigm for trustworthy table generation evaluation.
📝 Abstract
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.