π€ AI Summary
This work addresses the challenges of offline goal-conditioned reinforcement learning in partially observable, history-dependent non-Markovian environments, where sparse rewards lack discriminability and demonstration trajectories are difficult to stitch together effectively. To overcome these issues, the authors propose the QHyer framework, which replaces conventional reward signals with a state-conditional goal-reaching Q-estimator and leverages flow-based parameterization to enhance cross-trajectory behavioral stitching. Furthermore, QHyer introduces a gated mixture of attention and Mamba backbone network coupled with a content-adaptive history compression mechanism, enabling adaptive modeling of both local dynamics and long-range dependencies while circumventing the limitations of fixed-window observation extraction. Experimental results demonstrate that QHyer achieves state-of-the-art performance on both Markovian and non-Markovian datasets, confirming its effectiveness across diverse scenarios.
π Abstract
Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose \textbf{QHyer}, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that \textbf{QHyer} achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.