Online Finetuning Decision Transformers with Pure RL Gradients

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing online fine-tuning methods for Decision Transformers (DTs), which rely on supervised learning objectives and struggle to effectively leverage pure reinforcement learning (RL) gradients, often exhibiting incompatibility with hindsight return relabeling and importance sampling. To overcome these challenges, we propose the first purely RL gradient-driven online fine-tuning framework for DTs. By enhancing the GRPO algorithm with sub-trajectory optimization, a sequence-level likelihood objective, and an active sampling mechanism, our approach significantly improves credit assignment, training stability, and exploration efficiency. Extensive experiments demonstrate that the proposed method outperforms current online DT approaches across multiple benchmarks, achieving new state-of-the-art results and validating the efficacy and superiority of pure RL-based online fine-tuning for Decision Transformers.

Technology Category

Application Category

📝 Abstract
Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making by formulating offline reinforcement learning (RL) as a sequence modeling problem. However, extending DTs to online settings with pure RL gradients remains largely unexplored, as existing approaches continue to rely heavily on supervised sequence-modeling objectives during online finetuning. We identify hindsight return relabeling -- a standard component in online DTs -- as a critical obstacle to RL-based finetuning: while beneficial for supervised learning, it is fundamentally incompatible with importance sampling-based RL algorithms such as GRPO, leading to unstable training. Building on this insight, we propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients. We adapt GRPO to DTs and introduce several key modifications, including sub-trajectory optimization for improved credit assignment, sequence-level likelihood objectives for enhanced stability and efficiency, and active sampling to encourage exploration in uncertain regions. Through extensive experiments, we demonstrate that our methods outperform existing online DT baselines and achieve new state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of pure-RL-based online finetuning for Decision Transformers.
Problem

Research questions and friction points this paper is trying to address.

Decision Transformers
online finetuning
pure RL gradients
hindsight return relabeling
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decision Transformers
online finetuning
pure RL gradients
GRPO
hindsight relabeling
🔎 Similar Papers
No similar papers found.