AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing search-augmented large language models (LLMs) suffer from inefficient retrieval and insufficient reasoning capabilities in complex multi-hop reasoning tasks. Method: This paper proposes a collaborative self-play framework wherein a single LLM alternately assumes the roles of “Decomposer” and “Solver” to jointly perform question decomposition, multi-hop retrieval, and chain-of-thought reasoning. We introduce a novel reinforcement-based self-play training paradigm that requires no intermediate annotations, combined with task-mixed supervised and reinforcement fine-tuning to substantially reduce parameter dependency. Contribution/Results: Experiments demonstrate an average 7.6% improvement in exact match accuracy across ten benchmark datasets. AceSearcher-32B achieves performance on par with DeepSeek-V3 on financial document reasoning—despite using only 5% of its parameters. Moreover, small models (1.5B–8B) outperform state-of-the-art models up to nine times larger in parameter count.

Technology Category

Application Category

📝 Abstract
Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at https://github.com/ritaranx/AceSearcher and https://huggingface.co/AceSearcher.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-hop retrieval for complex reasoning tasks
Improving reasoning ability without intermediate annotations
Boosting efficiency of search-augmented LLMs with fewer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-play framework trains single LLM alternating roles
Combines supervised and reinforcement fine-tuning without annotations
Achieves superior performance with significantly fewer parameters
🔎 Similar Papers
No similar papers found.