Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low sample efficiency and inadequate modeling of action-dimensional dependencies when leveraging non-expert demonstrations and online suboptimal data in continuous control, this paper proposes the Autoregressive Soft Q-Network (AR-SQNet). Methodologically, it introduces a novel coarse-to-fine hierarchical action-space discretization and explicitly models conditional dependencies among action dimensions via an autoregressive architecture, jointly optimizing the Soft Q-learning objective and advantage-value sequence prediction. An offline-online hybrid training framework is adopted to enhance generalization over heterogeneous suboptimal data. Evaluated on D4RL—including non-expert datasets—AR-SQNet achieves a 1.62× average performance gain; on RLBench—with expert demonstrations—it significantly surpasses state-of-the-art methods. The core contribution lies in the first integration of autoregressive modeling into soft Q-learning to capture action coupling, thereby breaking the conventional independent-dimension assumption and enabling efficient, robust utilization of diverse suboptimal data.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to"kick-start"training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62 imes$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Sample Efficiency
Continuous Control Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-Regressive Soft Q-learning
continuous control tasks
coarse-to-fine modeling
Jijia Liu
Jijia Liu
Tsinghua University
F
Feng Gao
Tsinghua University, Beijing, China
Q
Qingmin Liao
Tsinghua University, Beijing, China
C
Chao Yu
Tsinghua University, Beijing, China
Y
Yu Wang
Tsinghua University, Beijing, China