🤖 AI Summary
This work addresses the challenge of cross-table retrieval and information composition in open-domain question answering. We propose an iterative multi-table retrieval framework that jointly optimizes semantic relevance, query coverage, and structural connectability. To our knowledge, this is the first approach to formulate multi-table retrieval as a greedy iterative search process, incorporating a lightweight joint-aware algorithm that dynamically evaluates semantic matching, coverage completeness, and inter-table joinability at each step. Evaluated on five mainstream NL2SQL benchmarks, our method achieves retrieval accuracy comparable to exact MIP solvers while accelerating inference by 4–400×, significantly outperforming conventional single-objective heuristic methods. Our core contributions are: (i) the first iterative retrieval paradigm that jointly optimizes semantic relevance, coverage, and structural connectability; and (ii) an efficient, interpretable, and scalable solution for multi-table joint retrieval.
📝 Abstract
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.