🤖 AI Summary
This work addresses the challenges of partial observability and perceptual inconsistency in large-scale dynamic scenes caused by missing viewpoints and sparse observations. To this end, the authors propose AW4RE, a perception-centric generative world model that uniquely integrates 4D spatiotemporal information into active world modeling. AW4RE enables active querying of native sensor environments through 4D-informed evidence retrieval, action-conditioned geometric modeling, temporal consistency constraints, and conditional generative completion. Under extreme conditions—including severe viewpoint shifts, temporal discontinuities, and geometric sparsity—AW4RE substantially outperforms existing geometry-aware generative approaches, demonstrating markedly more stable and temporally coherent observation prediction capabilities.
📝 Abstract
Physical awareness, especially in a large and dynamic environment, is shaped by sensing decisions that determine observability across space, time, and scale, while observations impact the quality of sensing decisions. This loopy information structure makes physical awareness a fundamentally challenging decision problem with partial observations. While in the past decade we have witnessed the unprecedented success of reinforcement learning (RL) in problems with full observability, decision problems with partial observation, such as POMDPs, remain largely open: real-world explorations are excessively costly, while sim-to-real pipeline suffer from unobserved viewpoints. We introduce AW4RE (Active World-model with 4D-informed Retrieval for Exploration), an awareness-centric generative world model that provides a sensor-native surrogate environment for exploring sensing queries. Conditioned on a queried sensing action, AW4RE estimates the action-conditioned observation process. This is done by combining 4D-informed evidence retrieval, action-conditioned geometric support with temporal coherence, and conditional generative completion. Experiments demonstrate that AW4RE produces more grounded and consistent predictions than geometry-aware generative baselines under extreme viewpoint shifts, temporal gaps, and sparse geometric support.