Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge posed by delayed feedback in real-world systems, which violates the Markov property and leads to state-space explosion and poor sample efficiency. The authors propose Delayed Homomorphic Reinforcement Learning (DHRL), a novel framework that, for the first time, integrates MDP homomorphism theory into delayed-feedback settings. By constructing belief-equivalent augmented states and applying homomorphic compression, DHRL substantially reduces the size of the abstract MDP while preserving policy optimality. The method jointly optimizes policy and value networks and provides theoretical guarantees on state compression ratios and sample complexity. Empirical evaluations on MuJoCo continuous control tasks demonstrate that DHRL significantly outperforms existing baselines, with particularly pronounced advantages under long delay conditions.
📝 Abstract
Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.
Problem

Research questions and friction points this paper is trying to address.

delayed feedback
reinforcement learning
state-space explosion
sample complexity
Markov assumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

delayed reinforcement learning
MDP homomorphism
state augmentation
sample efficiency
belief-state compression
J
Jongsoo Lee
Department of Convergence IT Engineering, POSTECH, Pohang, 37673, South Korea
J
Jangwon Kim
Department of Convergence IT Engineering, POSTECH, Pohang, 37673, South Korea
Soohee Han
Soohee Han
Professor of Electrical Engineering and Convergence IT Engineering, POSTECH
Reinforcement learningMathematical InstrumentationBattery informatics