Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This paper addresses information loss and verbalization errors in natural language rendering (NLR) of table query results within Text-to-SQL systems. To this end, we propose Combo-Eval—the first dedicated framework for NLR quality assessment. Its core contributions are threefold: (1) the construction of NLR-BIRD, the first benchmark dataset specifically designed for NLR evaluation; (2) a high-fidelity, reference-free automated evaluation scheme supporting both reference-available and reference-unavailable scenarios; and (3) a hybrid evaluation mechanism integrating semantic alignment, error detection, and multi-granularity human verification. Experiments demonstrate that Combo-Eval achieves strong agreement with human judgments (Cohen’s κ > 0.85), significantly outperforms existing methods across evaluation scenarios, and reduces large language model invocation overhead by 25–61%. Collectively, Combo-Eval establishes a reliable, efficient, and scalable evaluation paradigm for natural language generation from tabular data.

Technology Category

Application Category

📝 Abstract

In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated natural language representations of SQL outputs

Reducing information loss in tabular data narration by LLMs

Developing benchmark dataset for natural language representation evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combo-Eval combines multiple evaluation methods

Reduces LLM calls by 25-61%

Introduces NLR-BIRD dataset for benchmarking

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks