π€ AI Summary
Autonomous manipulation of articulated objects faces dual challenges: poor visual generalization and reliance on tactile initialization. To address these, we propose a vision-guided, tactile-refined collaborative framework that operates without prior kinematic models: vision provides global pose estimation and initial grasping configurations, while tactile feedback enables local closed-loop control. We incorporate surface normal geometry as a geometric prior to constrain motion directionality and employ the von MisesβFisher distribution to probabilistically model joint axis orientation, thereby enhancing cross-category generalization. Evaluated across over 50,000 trials in simulation and real-world settings, our method significantly outperforms baselines (p < 0.0001), demonstrating strong robustness, scalability, and zero-shot transfer capability across diverse articulated objects.
π Abstract
Autonomous manipulation of articulated objects remains a fundamental challenge for robots in human environments. Vision-based methods can infer hidden kinematics but can yield imprecise estimates on unfamiliar objects. Tactile approaches achieve robust control through contact feedback but require accurate initialization. This suggests a natural synergy: vision for global guidance, touch for local precision. Yet no framework systematically exploits this complementarity for generalized articulated manipulation. Here we present Vi-TacMan, which uses vision to propose grasps and coarse directions that seed a tactile controller for precise execution. By incorporating surface normals as geometric priors and modeling directions via von Mises-Fisher distributions, our approach achieves significant gains over baselines (all p<0.0001). Critically, manipulation succeeds without explicit kinematic models -- the tactile controller refines coarse visual estimates through real-time contact regulation. Tests on more than 50,000 simulated and diverse real-world objects confirm robust cross-category generalization. This work establishes that coarse visual cues suffice for reliable manipulation when coupled with tactile feedback, offering a scalable paradigm for autonomous systems in unstructured environments.