🤖 AI Summary
This work addresses the challenge that parameter-constrained large language models (LLMs) in agent-based reinforcement learning often generate noisy trajectories due to execution failures, leading to outcome-based rewards erroneously reinforcing incorrect behaviors and exacerbating credit assignment issues. To mitigate this, the authors propose a Similarity-Aware Adaptive Rollback mechanism that leverages the model’s intrinsic self-correction capability. During data collection, the method adaptively repairs trajectories—ranging from shallow execution fixes to deep reasoning replacements—based on semantic similarity, thereby enabling self-purification of trajectories without external filtering or costly oversampling. Evaluated on AIME24/25, GPQA, and LiveCodeBench, the approach improves average accuracy by 6%, 3%, and 5%, respectively, and achieves state-of-the-art performance using only one-third of the typical training steps.
📝 Abstract
Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub