Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge posed by delayed feedback in real-world systems, which violates the Markov property and leads to state-space explosion and poor sample efficiency. The authors propose Delayed Homomorphic Reinforcement Learning (DHRL), a novel framework that, for the first time, integrates MDP homomorphism theory into delayed-feedback settings. By constructing belief-equivalent augmented states and applying homomorphic compression, DHRL substantially reduces the size of the abstract MDP while preserving policy optimality. The method jointly optimizes policy and value networks and provides theoretical guarantees on state compression ratios and sample complexity. Empirical evaluations on MuJoCo continuous control tasks demonstrate that DHRL significantly outperforms existing baselines, with particularly pronounced advantages under long delay conditions.

Technology Category

Application Category

📝 Abstract

Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.

Problem

Research questions and friction points this paper is trying to address.

delayed feedback

reinforcement learning

state-space explosion

sample complexity

Markov assumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

delayed reinforcement learning

MDP homomorphism

state augmentation