🤖 AI Summary
Vision-based locomotion for legged robots suffers from limited robustness due to occlusions, specular reflections, and illumination variations. To address this, we propose KiVi, a framework that decouples and synergistically integrates proprioceptive and visual pathways: proprioception serves as a stable backbone, while visual spatial reasoning is selectively fused via a memory-augmented attention mechanism—enhancing resilience to out-of-distribution noise and severe occlusions. KiVi unifies deep reinforcement learning, multimodal sensor fusion, and joint visuo-proprioceptive modeling. Experiments demonstrate that a quadrupedal robot achieves dynamic, stable walking across diverse, unstructured outdoor terrains. Compared to vision-only or standard sensor-fusion baselines, KiVi significantly improves real-world reliability and generalization capability under challenging perceptual conditions.
📝 Abstract
Vision-based locomotion has shown great promise in enabling legged robots to perceive and adapt to complex environments. However, visual information is inherently fragile, being vulnerable to occlusions, reflections, and lighting changes, which often cause instability in locomotion. Inspired by animal sensorimotor integration, we propose KiVi, a Kinesthetic-Visuospatial integration framework, where kinesthetics encodes proprioceptive sensing of body motion and visuospatial reasoning captures visual perception of surrounding terrain. Specifically, KiVi separates these pathways, leveraging proprioception as a stable backbone while selectively incorporating vision for terrain awareness and obstacle avoidance. This modality-balanced, yet integrative design, combined with memory-enhanced attention, allows the robot to robustly interpret visual cues while maintaining fallback stability through proprioception. Extensive experiments show that our method enables quadruped robots to stably traverse diverse terrains and operate reliably in unstructured outdoor environments, remaining robust to out-of-distribution (OOD) visual noise and occlusion unseen during training, thereby highlighting its effectiveness and applicability to real-world legged locomotion.