🤖 AI Summary
This work addresses the challenge of accurately estimating physical object properties—such as geometry, stiffness, and contact dynamics—in robotic systems operating under non-rigid object interactions and complex frictional conditions. To this end, the authors propose the Cross-Modal Latent Filter (CMLF), which constructs a structured causal latent state space to dynamically integrate visual and tactile sensory streams through Bayesian temporal inference. Inspired by human multisensory perception, CMLF incorporates a bidirectional cross-modal prior propagation mechanism, enabling the first robotic replication of human-like perceptual coupling phenomena, including cross-modal illusions. Experimental results demonstrate that CMLF substantially enhances the robustness and efficiency of physical property estimation while exhibiting cross-sensory learning trajectories and perceptual behaviors closely mirroring those observed in humans.
📝 Abstract
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.