🤖 AI Summary
This work addresses the limitations of traditional table retrieval methods in scenarios where metadata is missing or of poor quality. The authors propose a content-based table retrieval framework that first constructs table summaries and then leverages large language models (LLMs) to generate pseudo-queries, which are combined with dense vector retrieval to rank candidate datasets. This approach represents the first effort to jointly utilize LLM-generated pseudo-queries and table summaries for table dataset retrieval, thereby departing from conventional paradigms that rely heavily on metadata or are tailored specifically for table question answering (TableQA). Experimental results demonstrate that the proposed method significantly outperforms both metadata-based baselines and strong TableQA-oriented retrieval approaches under low-quality metadata conditions, confirming the effectiveness and superiority of content-driven modeling for table retrieval.
📝 Abstract
The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.