๐ค AI Summary
This work addresses the challenge of efficient navigation in dynamic environments, which requires anticipating motion patterns beyond the current perceptual horizon. Conventional approaches rely on prolonged global observations to construct maps of dynamics (MoD). In contrast, this study proposes a video-pose conditioned neural architecture that predicts future global MoD using only short-term, local first-person video and pose data. Notably, this is the first method to achieve MoD prediction without access to global observations during inference, leveraging externally generated MoD as privileged supervision during training. Experiments demonstrate that the model accurately infers global dynamic trends from limited local observations in large-scale simulated environments and successfully enables zero-shot transfer to real-world robotic systems.
๐ Abstract
Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. Experiments in large simulated environments show that EgoMoD accurately predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.