🤖 AI Summary
This work addresses the core challenges of “where to act” and “how to act” in robotic interaction with cross-category heterogeneous objects by proposing a task-conditioned two-stage framework. It first aligns the initial contact state using structural priors to reduce grasp uncertainty, then employs a multi-foundation-model diffusion policy for fine-grained interaction. The approach innovatively decouples grasping from execution, integrating category-specific foundation models and a dual-stream cross-attention mechanism. A novel Foundation-Correspondence-Guided Grasp module and the Multi-Foundation-Model Diffusion Policy (MFMDP) jointly leverage geometric and part-level features for decision-making. Evaluated in simulation and real-world settings, the method achieves average performance gains of 31% and 36.7% across four heterogeneous object categories, respectively, significantly improving generalization to diverse object shapes and poses.
📝 Abstract
Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: ``where to manipulate'' (contact point localization) and ``how to manipulate'' (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31\% performance improvement in simulation tasks with broad type setting, alongside a 36.7\% gain across four real-world tasks with different interaction types.