SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) for multi-turn tool-integrated reasoning (TIR) suffers from distributional shift induced by noisy or uninformative external tool feedback, leading to training instability and catastrophic performance collapse. Method: We propose SimpleTIR—a novel RL algorithm that introduces a trajectory filtering mechanism to automatically identify and prune interaction turns yielding no valid tool output. This mitigates catastrophic gradient explosion and stabilizes policy gradient optimization—without requiring additional supervision. Contribution/Results: SimpleTIR enables end-to-end emergence of sophisticated reasoning behaviors, including self-correction and cross-validation. On mathematical reasoning, it achieves state-of-the-art performance, raising the AIME24 score from 22.1 to 50.5. Our results empirically validate that alleviating distributional shift through trajectory purification is critical for robust multi-turn TIR training.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.
Problem

Research questions and friction points this paper is trying to address.

Stabilizing multi-turn tool-integrated reinforcement learning training
Addressing distributional drift from external tool feedback
Preventing catastrophic gradient norm explosions in reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play algorithm stabilizes multi-turn TIR training
Filters void turns to block harmful gradient explosions
Enables diverse reasoning patterns without supervised fine-tuning