Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates whether reinforcement learning (RL) genuinely expands the capability boundaries of large language model (LLM) agents in tool-use tasks, rather than merely improving execution reliability. To this end, the authors propose PASS@(k,T), a two-dimensional evaluation metric that jointly accounts for sampling budget *k* and interaction depth *T*, and systematically compare RL fine-tuning, supervised fine-tuning, and baseline models. Experiments reveal, for the first time, that RL substantially extends LLMs’ capabilities in compositional, sequential information-gathering tasks—an improvement not replicable through resampling—whereas supervised fine-tuning leads to performance degradation. Mechanistic analysis further demonstrates that RL optimizes information integration strategies, confirming its role in achieving fundamental capability gains in complex, dynamic interactions.

Technology Category

Application Category

📝 Abstract

Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

large language models

capability boundary

tool use

compositional reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

PASS@(k,T)

reinforcement learning

LLM agents