CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge that parameter-constrained large language models (LLMs) in agent-based reinforcement learning often generate noisy trajectories due to execution failures, leading to outcome-based rewards erroneously reinforcing incorrect behaviors and exacerbating credit assignment issues. To mitigate this, the authors propose a Similarity-Aware Adaptive Rollback mechanism that leverages the model’s intrinsic self-correction capability. During data collection, the method adaptively repairs trajectories—ranging from shallow execution fixes to deep reasoning replacements—based on semantic similarity, thereby enabling self-purification of trajectories without external filtering or costly oversampling. Evaluated on AIME24/25, GPQA, and LiveCodeBench, the approach improves average accuracy by 6%, 3%, and 5%, respectively, and achieves state-of-the-art performance using only one-third of the typical training steps.

Technology Category

Application Category

📝 Abstract

Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub

Problem

Research questions and friction points this paper is trying to address.

Agentic Reinforcement Learning

execution failures

noisy trajectories

credit assignment

trajectory purification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Purified Trajectories

Similarity-Aware Adaptive Rollback

Agentic Reinforcement Learning