NLCTables: A Dataset for Marrying Natural Language Conditions with Table Discovery

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing table discovery methods rely on ambiguous keywords, yield redundant results, and require manual filtering. To address these limitations, this paper introduces natural language–conditioned table discovery (nlcTD): a novel task where users jointly specify a query table and a natural language description to enable precise, semantics-aware table retrieval. We construct nlcTables—the first large-scale, multi-condition benchmark dataset—comprising 627 queries across four categories (NL-only, union, join, fuzzy) and 21,200 fine-grained human annotations, all derived from real-world table repositories. Comprehensive evaluation of six state-of-the-art methods reveals an average accuracy below 0.35 on nlcTD, confirming its substantial difficulty. We publicly release the nlcTables dataset, annotation framework, and baseline implementations to establish a new foundation for semantic table search research.

Technology Category

Application Category

📝 Abstract

With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available at https://github.com/SuDIS-ZJU/nlcTables to foster future research.

Problem

Research questions and friction points this paper is trying to address.

Discovering relevant tables using natural language conditions

Addressing performance gaps in table discovery methods

Providing a benchmark dataset for NL-conditional table discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

NL-conditional table discovery task

Comprehensive benchmark dataset nlcTables

Evaluation of six table discovery methods

🔎 Similar Papers

No similar papers found.

Authors to Follow