Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index

๐Ÿ“… 2025-04-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address error accumulation and insufficient semantic alignment in conventional dense retrieval paradigms for natural languageโ€“driven table discovery, this paper proposes an end-to-end differentiable search indexing framework that unifies indexing and retrieval within an encoder-decoder language model. Its key contributions are: (1) prefix-aware table identifier embedding, enhancing structured semantic modeling; (2) large-model-driven synthetic query generation, improving training data quality and generalization; and (3) a parameter-isolated dynamic index update mechanism, substantially mitigating catastrophic forgetting in continual learning. Experiments demonstrate that the method achieves a 16.8% absolute accuracy gain over state-of-the-art dense retrieval models on table discovery tasks and reduces forgetting rate by over 90% in continual indexing scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Natural language (NL)-driven table discovery identifies relevant tables from large table repositories based on NL queries. While current deep-learning-based methods using the traditional dense vector search pipeline, i.e., representation-index-search, achieve remarkable accuracy, they face several limitations that impede further performance improvements: (i) the errors accumulated during the table representation and indexing phases affect the subsequent search accuracy; and (ii) insufficient query-table interaction hinders effective semantic alignment, impeding accuracy improvements. In this paper, we propose a novel framework Birdie, using a differentiable search index. It unifies the indexing and search into a single encoder-decoder language model, thus getting rid of error accumulations. Birdie first assigns each table a prefix-aware identifier and leverages a large language model-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting. Extensive experiments demonstrate that Birdie outperforms state-of-the-art dense methods by 16.8% in accuracy, and reduces forgetting by over 90% compared to other continual learning approaches.
Problem

Research questions and friction points this paper is trying to address.

Improves NL-driven table discovery accuracy
Reduces error accumulation in indexing and search
Enhances query-table semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies indexing and search via encoder-decoder model
Uses prefix-aware identifiers and synthetic queries
Implements parameter isolation for index updates
๐Ÿ”Ž Similar Papers
Yuxiang Guo
Yuxiang Guo
Johns Hopskin University
Computer vision
Z
Zhonghao Hu
Zhejiang University
Y
Yuren Mao
Zhejiang University
B
Baihua Zheng
Singapore Management University
Yunjun Gao
Yunjun Gao
Professor of Computer Science, Zhejiang University
DatabaseBig Data Management and Analyticsand AI Interaction with DB Technology
M
Mingwei Zhou
Zhejiang Dahua Technology Co., Ltd