🤖 AI Summary
This study systematically evaluates the practical deployment value of large language model (LLM)-based retrievers under complex queries, introducing for the first time a reasoning overhead metric to jointly assess efficiency, robustness, confidence reliability, and accuracy. Building upon the BRIGHT benchmark, the evaluation is extended across 12 tasks and 14 retrievers to analyze cold-start costs, latency distributions, throughput, corpus scalability, perturbation robustness, and confidence calibration. The experiments reveal that specialized reasoning-based retrievers achieve both high throughput and strong effectiveness; large dual-encoders incur high latency with limited gains; reasoning augmentation yields significant improvements with low overhead for small models (<1B parameters) but shows diminishing or even detrimental returns for state-of-the-art models and tasks involving mathematics or code; and confidence miscalibration remains a widespread issue across all model types.
📝 Abstract
Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.