TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-turn retrieval-augmented reasoning methods often suffer from process homogenization and intra-group homogeneity due to their reliance solely on sparse final-answer rewards. To address this, this work proposes TSPO, which introduces a novel “first-occurrence implicit reward” mechanism that requires neither external reward models nor human annotations. By assigning partial rewards at the step where an answer is first generated, TSPO effectively preserves process-level signals and increases intra-group reward variance. Integrated with a reinforcement learning–driven multi-turn search strategy, stage-aware policy gradients, and an improved intra-group advantage estimator, TSPO achieves average performance gains of 24% and 13.6% on Qwen2.5-3B and Qwen2.5-7B, respectively, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a"Double Homogenization Dilemma."This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.
Problem

Research questions and friction points this paper is trying to address.

Double Homogenization Dilemma
Multi-turn Search Policy Optimization
Process homogenization
Intra-group homogenization
Sparse reward
Innovation

Methods, ideas, or system contributions that make the work stand out.

TSPO
FOLR
multi-turn reasoning
reward shaping
policy optimization
🔎 Similar Papers
No similar papers found.
S
Shichao Ma
University of Science and Technology of China; Tiansuan Lab, Ant Group Co., Ltd.
Z
Zhiyuan Ma
Tiansuan Lab, Ant Group Co., Ltd.
M
Ming Yang
Tiansuan Lab, Ant Group Co., Ltd.; Fudan University
Xiaofan Li
Xiaofan Li
East China Normal University
Computer Vision
X
Xing Wu
Tiansuan Lab, Ant Group Co., Ltd.
J
Jintao Du
Tiansuan Lab, Ant Group Co., Ltd.
Y
Yu Cheng
Tiansuan Lab, Ant Group Co., Ltd.
Weiqiang Wang
Weiqiang Wang
ant financials
Machine LearningSimulation
Q
Qiliang Liu
Central South University
Z
Zhen-Qiang Zhou
University of Science and Technology of China
Yang Wang
Yang Wang
University of Science and Technology of China
Artificial intelligenceWireless NetworkDistributed SystemBig Data