🤖 AI Summary
Existing Mamba-based world models struggle to effectively capture subtle state transitions, leading to limited reasoning quality and sample efficiency in environments characterized by coupled local–global dynamics. To address this, we propose the first change-aware world model specifically designed for the Mamba architecture. Our approach introduces a global–local dual-perspective variation-aware coordination mechanism, implemented via GMamba and LMamba dual-branch modules that explicitly model state changes—not merely raw sequences. We further integrate temporal variation prediction with end-to-end model-based reinforcement learning (MBRL) in a unified training framework. Evaluated on the Atari 100k benchmark, our method achieves a significantly higher normalized human score than prior state-of-the-art approaches, demonstrating that change-aware modeling substantially improves both sample efficiency and policy performance.
📝 Abstract
Mimicking the real interaction trajectory in the inference of the world model has been shown to improve the sample efficiency of model-based reinforcement learning (MBRL) algorithms. Many methods directly use known state sequences for reasoning. However, this approach fails to enhance the quality of reasoning by capturing the subtle variation between states. Much like how humans infer trends in event development from this variation, in this work, we introduce Global-Local variation Awareness Mamba-based world model (GLAM) that improves reasoning quality by perceiving and predicting variation between states. GLAM comprises two Mambabased parallel reasoning modules, GMamba and LMamba, which focus on perceiving variation from global and local perspectives, respectively, during the reasoning process. GMamba focuses on identifying patterns of variation between states in the input sequence and leverages these patterns to enhance the prediction of future state variation. LMamba emphasizes reasoning about unknown information, such as rewards, termination signals, and visual representations, by perceiving variation in adjacent states. By integrating the strengths of the two modules, GLAM accounts for highervalue variation in environmental changes, providing the agent with more efficient imagination-based training. We demonstrate that our method outperforms existing methods in normalized human scores on the Atari 100k benchmark.