Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

📅 2024-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low query allocation efficiency, high computational overhead, and unreliable confidence estimation in extractive question answering with large language models under resource-constrained settings, this paper proposes the first learning-based query allocation framework with theoretically optimal latency guarantees. Our method integrates a learning-to-defer mechanism with a theory-driven dynamic scheduling strategy, enabling a multi-expert collaborative inference architecture that adaptively routes incoming queries to specialized submodels. Experiments on SQuADv1/v2 and TriviaQA demonstrate substantial improvements in answer reliability alongside significant reductions in computational cost. Notably, our approach achieves, for the first time, a provably balanced trade-off between accuracy and latency—enabling efficient, scalable, and lightweight deployment of large language models in resource-limited environments.

Technology Category

Application Category

📝 Abstract
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.
Problem

Research questions and friction points this paper is trying to address.

Optimize query allocation in extractive QA
Improve efficiency in resource-constrained environments
Balance performance and computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning-to-Defer framework
Query allocation strategy
Optimal deferral guarantees
🔎 Similar Papers
No similar papers found.