🤖 AI Summary
This study investigates whether reinforcement learning (RL) genuinely expands the capability boundaries of large language model (LLM) agents in tool-use tasks, rather than merely improving execution reliability. To this end, the authors propose PASS@(k,T), a two-dimensional evaluation metric that jointly accounts for sampling budget *k* and interaction depth *T*, and systematically compare RL fine-tuning, supervised fine-tuning, and baseline models. Experiments reveal, for the first time, that RL substantially extends LLMs’ capabilities in compositional, sequential information-gathering tasks—an improvement not replicable through resampling—whereas supervised fine-tuning leads to performance degradation. Mechanistic analysis further demonstrates that RL optimizes information integration strategies, confirming its role in achieving fundamental capability gains in complex, dynamic interactions.
📝 Abstract
Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.