Cross-Modal Visuo-Tactile Object Perception

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately estimating physical object properties—such as geometry, stiffness, and contact dynamics—in robotic systems operating under non-rigid object interactions and complex frictional conditions. To this end, the authors propose the Cross-Modal Latent Filter (CMLF), which constructs a structured causal latent state space to dynamically integrate visual and tactile sensory streams through Bayesian temporal inference. Inspired by human multisensory perception, CMLF incorporates a bidirectional cross-modal prior propagation mechanism, enabling the first robotic replication of human-like perceptual coupling phenomena, including cross-modal illusions. Experimental results demonstrate that CMLF substantially enhances the robustness and efficiency of physical property estimation while exhibiting cross-sensory learning trajectories and perceptual behaviors closely mirroring those observed in humans.
📝 Abstract
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.
Problem

Research questions and friction points this paper is trying to address.

cross-modal perception
visuo-tactile sensing
physical property estimation
uncertainty modeling
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal perception
visuo-tactile fusion
Bayesian inference
latent state-space model
active inference
🔎 Similar Papers
No similar papers found.