PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the trade-off between efficiency and generalization in long-horizon agent tasks, where supervised fine-tuning (SFT) is sample-efficient but suffers from poor generalization, while end-to-end reinforcement learning (E2E RL) generalizes well but incurs high computational costs. To bridge this gap, the authors propose PivotRL, a framework that performs localized on-policy rollouts over SFT trajectories to identify highly informative “pivot” intermediate steps. PivotRL introduces a functionally equivalent action reward mechanism that preserves the relative ordering of action probabilities while delivering a strong learning signal. This approach synergistically combines the efficiency of SFT with the generalization strength of E2E RL, achieving an average 4.17% improvement in in-domain accuracy across four agent tasks and a 10.04% gain in out-of-domain accuracy on non-agent tasks. On code generation benchmarks, PivotRL matches E2E RL performance using only one-quarter of the rollout iterations and has been deployed in NVIDIA’s Nemotron-3-Super-120B-A12B production system.

Technology Category

Application Category

📝 Abstract

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.

Problem

Research questions and friction points this paper is trying to address.

post-training

out-of-domain generalization

compute efficiency

agentic tasks

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

PivotRL

agentic post-training

out-of-domain generalization