🤖 AI Summary
To address low sample efficiency and inadequate modeling of action-dimensional dependencies when leveraging non-expert demonstrations and online suboptimal data in continuous control, this paper proposes the Autoregressive Soft Q-Network (AR-SQNet). Methodologically, it introduces a novel coarse-to-fine hierarchical action-space discretization and explicitly models conditional dependencies among action dimensions via an autoregressive architecture, jointly optimizing the Soft Q-learning objective and advantage-value sequence prediction. An offline-online hybrid training framework is adopted to enhance generalization over heterogeneous suboptimal data. Evaluated on D4RL—including non-expert datasets—AR-SQNet achieves a 1.62× average performance gain; on RLBench—with expert demonstrations—it significantly surpasses state-of-the-art methods. The core contribution lies in the first integration of autoregressive modeling into soft Q-learning to capture action coupling, thereby breaking the conventional independent-dimension assumption and enabling efficient, robust utilization of diverse suboptimal data.
📝 Abstract
Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to"kick-start"training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62 imes$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.