🤖 AI Summary
This work addresses the challenge posed by delayed feedback in real-world systems, which violates the Markov property and leads to state-space explosion and poor sample efficiency. The authors propose Delayed Homomorphic Reinforcement Learning (DHRL), a novel framework that, for the first time, integrates MDP homomorphism theory into delayed-feedback settings. By constructing belief-equivalent augmented states and applying homomorphic compression, DHRL substantially reduces the size of the abstract MDP while preserving policy optimality. The method jointly optimizes policy and value networks and provides theoretical guarantees on state compression ratios and sample complexity. Empirical evaluations on MuJoCo continuous control tasks demonstrate that DHRL significantly outperforms existing baselines, with particularly pronounced advantages under long delay conditions.
📝 Abstract
Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.