🤖 AI Summary
To address low query allocation efficiency, high computational overhead, and unreliable confidence estimation in extractive question answering with large language models under resource-constrained settings, this paper proposes the first learning-based query allocation framework with theoretically optimal latency guarantees. Our method integrates a learning-to-defer mechanism with a theory-driven dynamic scheduling strategy, enabling a multi-expert collaborative inference architecture that adaptively routes incoming queries to specialized submodels. Experiments on SQuADv1/v2 and TriviaQA demonstrate substantial improvements in answer reliability alongside significant reductions in computational cost. Notably, our approach achieves, for the first time, a provably balanced trade-off between accuracy and latency—enabling efficient, scalable, and lightweight deployment of large language models in resource-limited environments.
📝 Abstract
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.