Return-Aligned Decision Transformer

📅 2024-02-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

To address insufficient target-return controllability in offline reinforcement learning, this paper proposes the Return-Aligned Decision Transformer (RADT), a novel decision transformer architecture. Its core innovation is a return-alignment mechanism—comprising target-return-specific feature encoding, regression-oriented attention isolation, and supervised sequence modeling—that explicitly conditions action generation on the target return rather than historical trajectories. This mechanism enables precise alignment between target and realized returns within the Transformer framework. Evaluated on multiple standard offline RL benchmarks, RADT significantly reduces the deviation between target and actual returns, consistently outperforming the original Decision Transformer and leading variants. Notably, it achieves high-precision, target-controllable decision-making without requiring online fine-tuning—a first in the literature.

Technology Category

Application Category

📝 Abstract

Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return. It is increasingly important to adjust the performance of AI agents to meet human requirements, for example, in applications like video games and education tools. Decision Transformer (DT) optimizes a policy that generates actions conditioned on the target return through supervised learning and includes a mechanism to control the agent's performance using the target return. However, the action generation is hardly influenced by the target return because DT's self-attention allocates scarce attention scores to the return tokens. In this paper, we propose Return-Aligned Decision Transformer (RADT), designed to more effectively align the actual return with the target return. RADT leverages features extracted by paying attention solely to the return, enabling action generation to consistently depend on the target return. Extensive experiments show that RADT significantly reduces the discrepancies between the actual return and the target return compared to DT-based methods.

Problem

Research questions and friction points this paper is trying to address.

Offline Learning

Behavior Optimization

Score Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-aligned Decision Transformer

Target Return Precision

Offline Learning Enhancement

🔎 Similar Papers

No similar papers found.