🤖 AI Summary
Existing retrievers struggle to provide complementary evidence that supports multi-step reasoning in inference-intensive tasks and lack effective evaluation and training methodologies tailored for agent-driven search scenarios. To address these limitations, this work introduces BRIGHT-Pro, a novel benchmark that enables multi-perspective evidence annotation and evaluation under an agent-based search protocol. Additionally, the authors construct RTriever-Synth, a synthetic corpus designed to enhance evidence composition capabilities through aspect decomposition and a positive-sample-conditioned hard negative generation strategy. Building upon Qwen3-Embedding-4B and fine-tuned with LoRA, the resulting RTriever-4B model significantly outperforms baseline approaches under both established and newly proposed evaluation protocols, demonstrating its effectiveness in constructing reasoning-oriented evidence retrieval systems.
📝 Abstract
Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.