Natural Language Query to Configuration for Retrieval Agents

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of balancing answer accuracy and service cost for retrieval agents across diverse natural language queries. To this end, we propose BRANE, a framework that enables dynamic, per-query configuration of the entire retrieval pipeline for the first time. BRANE leverages a large language model to extract query features, trains a lightweight configuration predictor, and incorporates a cost-aware penalty mechanism to select the optimal pipeline from a predefined set at inference time—without requiring retraining. Experimental results on MuSiQue, BrowseComp-Plus, and FinanceBench demonstrate that BRANE significantly advances the cost–accuracy Pareto frontier, achieving up to an 89% reduction in cost while matching the accuracy of the best fixed configuration, and consistently outperforming LLM-based routing, rule-based baselines, and fine-tuned Qwen3-4B.

📝 Abstract

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

Problem

Research questions and friction points this paper is trying to address.

retrieval agents

natural language query

configuration selection

cost-quality tradeoff

inference-time optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

per-query configuration

retrieval agents

cost-quality tradeoff