NLCTables: A Dataset for Marrying Natural Language Conditions with Table Discovery

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing table discovery methods rely on ambiguous keywords, yield redundant results, and require manual filtering. To address these limitations, this paper introduces natural language–conditioned table discovery (nlcTD): a novel task where users jointly specify a query table and a natural language description to enable precise, semantics-aware table retrieval. We construct nlcTables—the first large-scale, multi-condition benchmark dataset—comprising 627 queries across four categories (NL-only, union, join, fuzzy) and 21,200 fine-grained human annotations, all derived from real-world table repositories. Comprehensive evaluation of six state-of-the-art methods reveals an average accuracy below 0.35 on nlcTD, confirming its substantial difficulty. We publicly release the nlcTables dataset, annotation framework, and baseline implementations to establish a new foundation for semantic table search research.

Technology Category

Application Category

📝 Abstract
With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available at https://github.com/SuDIS-ZJU/nlcTables to foster future research.
Problem

Research questions and friction points this paper is trying to address.

Discovering relevant tables using natural language conditions
Addressing performance gaps in table discovery methods
Providing a benchmark dataset for NL-conditional table discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

NL-conditional table discovery task
Comprehensive benchmark dataset nlcTables
Evaluation of six table discovery methods
🔎 Similar Papers
No similar papers found.
Lingxi Cui
Lingxi Cui
Zhejiang University
Table DiscoveryTable AugmentationLLM4Table
H
Huan Li
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China
K
Ke Chen
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, China
Lidan Shou
Lidan Shou
Professor of Computer Science, Zhejiang University
DatabaseData & Knowledge ManagementML Systems
G
Gang Chen
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China