TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of low sampling efficiency and mode collapse in multi-turn reinforcement learning, which arise from sparse or delayed rewards and environmental stochasticity. To mitigate these issues, the authors propose a lightweight tree-based search mechanism that transfers inference-time search strategies into the training phase. By integrating Best-of-N sampling, beam search, and shallow look-ahead, the method selects high-scoring actions at each step to construct higher-quality trajectories. The approach is optimizer-agnostic, compatible with existing frameworks such as PPO and GRPO, and preserves the original optimization objective. Evaluated on Sokoban, FrozenLake, and WebShop benchmarks, it achieves up to a 15% performance gain with only a one-time training overhead and significantly improves training stability.

Technology Category

Application Category

📝 Abstract

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

Problem

Research questions and friction points this paper is trying to address.

multi-turn reinforcement learning

sparse rewards

trajectory sampling

mode collapse

stochastic environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-Search Rollouts

multi-turn reinforcement learning

rollout generation

tree-style search

LLM agents

🔎 Similar Papers

No similar papers found.

Authors to Follow