๐ค AI Summary
This paper addresses information loss and verbalization errors in natural language rendering (NLR) of table query results within Text-to-SQL systems. To this end, we propose Combo-Evalโthe first dedicated framework for NLR quality assessment. Its core contributions are threefold: (1) the construction of NLR-BIRD, the first benchmark dataset specifically designed for NLR evaluation; (2) a high-fidelity, reference-free automated evaluation scheme supporting both reference-available and reference-unavailable scenarios; and (3) a hybrid evaluation mechanism integrating semantic alignment, error detection, and multi-granularity human verification. Experiments demonstrate that Combo-Eval achieves strong agreement with human judgments (Cohenโs ฮบ > 0.85), significantly outperforms existing methods across evaluation scenarios, and reduces large language model invocation overhead by 25โ61%. Collectively, Combo-Eval establishes a reliable, efficient, and scalable evaluation paradigm for natural language generation from tabular data.
๐ Abstract
In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.