🤖 AI Summary
This work addresses the challenges of low sampling efficiency and mode collapse in multi-turn reinforcement learning, which arise from sparse or delayed rewards and environmental stochasticity. To mitigate these issues, the authors propose a lightweight tree-based search mechanism that transfers inference-time search strategies into the training phase. By integrating Best-of-N sampling, beam search, and shallow look-ahead, the method selects high-scoring actions at each step to construct higher-quality trajectories. The approach is optimizer-agnostic, compatible with existing frameworks such as PPO and GRPO, and preserves the original optimization objective. Evaluated on Sokoban, FrozenLake, and WebShop benchmarks, it achieves up to a 15% performance gain with only a one-time training overhead and significantly improves training stability.
📝 Abstract
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.