🤖 AI Summary
Current neural retrieval models exhibit limited reasoning capabilities, while large language models (LLMs) incur prohibitive computational costs; moreover, query rewriting approaches struggle to support the iterative exploration and dynamic revision required for complex queries. This paper proposes Orion, the first framework that synergistically integrates reinforcement learning (RL) with synthetic trajectory training to empower a lightweight (1.2B-parameter) language model to autonomously perform multi-step reasoning, self-reflection, and dynamic query optimization prior to retrieval. Orion unifies synthetic trajectory generation, supervised fine-tuning, RL-based policy optimization, and beam search during inference, enabling end-to-end learning of retrieval strategies. Experiments demonstrate that Orion outperforms state-of-the-art retrievers—despite their parameters being 200–400× larger—on five of six mainstream benchmarks: SciFact (+77.6%), BRIGHT (+25.2%), and NFCorpus (+63.2%), thereby challenging the “scale-only” paradigm in neural retrieval.
📝 Abstract
Effective information retrieval requires reasoning over partial evidence and refining strategies as information emerges. Yet current approaches fall short: neural retrievers lack reasoning capabilities, large language models (LLMs) provide semantic depth but at prohibitive cost, and query rewriting or decomposition limits improvement to static transformations. As a result, existing methods fail to capture the iterative dynamics of exploration, feedback, and revision that complex user queries demand. We introduce Orion, a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies. Orion combines: (1) synthetic trajectory generation and supervised fine-tuning to encourage diverse exploration patterns in models, (2) reinforcement learning (RL) that rewards effective query refinement and backtracking behaviors, and (3) inference-time beam search algorithms that exploit the self-reflection capabilities learned during RL. Despite using only 3% of the training data available, our 1.2B model achieves 77.6% success on SciFact (vs. 72.6% for prior retrievers), 25.2% on BRIGHT (vs. 22.1%), 63.2% on NFCorpus (vs. 57.8%), and remains competitive on FEVER, HotpotQA, and MSMarco. It outperforms retrievers up to 200-400x larger on five of six benchmarks. These findings suggest that retrieval performance can emerge from learned strategies, not just model scale, when models are trained to search, reflect, and revise.