Self-supervised Hierarchical Visual Reasoning with World Model

πŸ“… 2026-05-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This work addresses the challenge of building effective visual reasoning representations for reinforcement learning in 3D open-world environments, where vast state spaces and compounding multi-step prediction errors hinder performance. The authors propose ResDreamer, a hierarchical world model that achieves progressive abstraction through residual reconstruction from higher to lower layers. Relying solely on self-supervised learning, ResDreamer extracts task-relevant visual representations and modulates low-level predictions via high-level residuals. By eschewing high-fidelity visual reconstruction in favor of task-critical signals, the method enables scalable visual foresight while maintaining linear cross-layer communication costs. Experiments demonstrate that ResDreamer achieves state-of-the-art sample and parameter efficiency, significantly enhancing agents’ online reasoning capabilities in dynamic, open-world settings.
πŸ“ Abstract
3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at \url{https://github.com/XuYuanFei01/ResDreamer}.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
open-world environments
visual reasoning
self-supervised learning
world model
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical world model
residual abstraction
self-supervised reasoning
visual foresight
sample efficiency
πŸ”Ž Similar Papers
No similar papers found.