FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Current table-to-text generation evaluation faces three critical challenges: training data contamination, domain imbalance, and scarcity of non-English resources. To address these, we propose WikiTableBench—a demand-driven, multilingual, domain-balanced dynamic benchmark built upon Wikipedia. It dynamically extracts tables from multilingual Wikipedia pages in real time and employs LLM-assisted generation coupled with human verification to ensure freshness and zero data leakage. Our method introduces a domain-aware, multidimensional evaluation framework that jointly accounts for linguistic diversity and task sensitivity. High-quality test sets are constructed for English, German, French, Russian, and other languages. Empirical analysis reveals substantial misalignment between mainstream automatic metrics and human judgments, and demonstrates that domain differences significantly impact generation quality assessment. This work establishes a new paradigm for fair, robust, and cross-lingual table-to-text evaluation.

Technology Category

Application Category

📝 Abstract

Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.

Problem

Research questions and friction points this paper is trying to address.

Addressing data contamination in table-to-text evaluation benchmarks

Solving domain imbalance issues in table-to-text generation assessment

Providing multilingual table-to-text datasets for comprehensive evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark generation from Wikipedia

Multilingual dataset collection on demand

Domain-balanced evaluation to address contamination

🔎 Similar Papers

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation