From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text

📅 2025-09-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing benchmarks focus narrowly on single-table, multi-focus factual question answering and lack evaluation frameworks for data product discovery in realistic analytical scenarios. Method: We introduce DPBench—the first user-request-driven benchmark for data product discovery—supporting joint retrieval across heterogeneous assets (tables and unstructured text). We propose a novel “data product–level” evaluation paradigm and develop a table-text co-discovery method grounded in semantic clustering and multi-LLM consensus verification to ensure full traceability and executable request fulfillment. Contribution/Results: DPBench comprises 1.2K+ expert-crafted data product requests (DPRs), covering multi-source, multi-step, and auditable requirements. Empirical analysis demonstrates the feasibility—and exposes key bottlenecks—of hybrid retrieval (dense + sparse) for this task, establishing the first standardized evaluation foundation for automated data product engineering.

Technology Category

Application Category

📝 Abstract

Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery and generation is of great industry interest, as it enables discovery in large data lakes and supports analytical Data Product Requests (DPRs). Currently, there is no benchmark established specifically for data product discovery. Existing datasets focus on answering single factoid questions over individual tables rather than collecting multiple data assets for broader, coherent products. To address this gap, we introduce DPBench, the first user-request-driven data product benchmark over hybrid table-text corpora. Our framework systematically repurposes existing table-text QA datasets by clustering related tables and passages into coherent data products, generating professional-level analytical requests that span both data sources, and validating benchmark quality through multi-LLM evaluation. DPBench preserves full provenance while producing actionable, analyst-like data product requests. Baseline experiments with hybrid retrieval methods establish the feasibility of DPR evaluation, reveal current limitations, and point to new opportunities for automatic data product discovery research. Code and datasets are available at: https://anonymous.4open.science/r/data-product-benchmark-BBA7/

Problem

Research questions and friction points this paper is trying to address.

Establishing the first benchmark for data product discovery

Repurposing existing datasets to create coherent data product requests

Evaluating hybrid retrieval methods for data product feasibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DPBench benchmark for data product discovery

Repurposes existing QA datasets into coherent data products

Uses hybrid retrieval methods for baseline experiments

🔎 Similar Papers

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering