OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of contact-rich robotic manipulation, where vision alone struggles to accurately perceive contact forces, friction, and state transitions, while existing tactile approaches lack explicit modeling of contact dynamics and high-frequency closed-loop control. To overcome these limitations, we propose OmniVTA, a framework integrating a self-supervised tactile encoder, a dual-stream visuo-tactile world model, a contact-aware fusion strategy, and a 60 Hz reflexive controller, accompanied by OmniViTac—a large-scale multimodal dataset. Our approach pioneers the use of tactile signals for explicitly modeling contact evolution and enabling 60 Hz closed-loop control, thereby transcending the traditional paradigm of touch as passive observation. Experiments across six contact-intensive tasks demonstrate significant performance gains over state-of-the-art methods, with strong generalization to unseen objects and geometric configurations, validating the efficacy of predictive contact modeling and high-frequency tactile feedback.

Technology Category

Application Category

📝 Abstract

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.

Problem

Research questions and friction points this paper is trying to address.

contact-rich manipulation

visuo-tactile perception

tactile feedback

world modeling

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

visuo-tactile world modeling

contact-rich manipulation

tactile feedback

closed-loop control

large-scale dataset

🔎 Similar Papers

Tac-Man: Tactile-Informed Prior-Free Manipulation of Articulated Objects

2024-03-04IEEE Transactions on roboticsCitations: 15

💼 Related Jobs

AI Research Scientist, Robotics