🤖 AI Summary
This work investigates whether Decision Transformer (DT) universally outperforms conventional MLP-based methods in offline reinforcement learning—particularly under sparse-reward settings. We conduct systematic evaluations on the Robomimic and D4RL benchmarks, comparing DT against supervised behavior cloning variants—including filtered behavior cloning (FBC)—and value-based approaches such as conservative Q-learning (CQL). Our results show that, on sparse-reward tasks, lightweight FBC matches or exceeds DT’s performance while requiring significantly less training time and fewer environment interactions. These findings challenge the prevailing assumption that DT is inherently well-suited for offline RL, revealing that sequence modeling does not necessarily confer advantages and may instead introduce redundant computation and optimization difficulties. The core contribution is an empirical demonstration that carefully designed supervised behavior cloning strategies can effectively replace transformer architectures in typical sparse-reward offline RL scenarios—providing critical evidence and a parsimony-oriented guideline for model selection.
📝 Abstract
In recent years, extensive work has explored the application of the Transformer architecture to reinforcement learning problems. Among these, Decision Transformer (DT) has gained particular attention in the context of offline reinforcement learning due to its ability to frame return-conditioned policy learning as a sequence modeling task. Most recently, Bhargava et al. (2024) provided a systematic comparison of DT with more conventional MLP-based offline RL algorithms, including Behavior Cloning (BC) and Conservative Q-Learning (CQL), and claimed that DT exhibits superior performance in sparse-reward and low-quality data settings.
In this paper, through experimentation on robotic manipulation tasks (Robomimic) and locomotion benchmarks (D4RL), we show that MLP-based Filtered Behavior Cloning (FBC) achieves competitive or superior performance compared to DT in sparse-reward environments. FBC simply filters out low-performing trajectories from the dataset and then performs ordinary behavior cloning on the filtered dataset. FBC is not only very straightforward, but it also requires less training data and is computationally more efficient. The results therefore suggest that DT is not preferable for sparse-reward environments. From prior work, arguably, DT is also not preferable for dense-reward environments. Thus, we pose the question: Is DT ever preferable?