HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the core challenges of “where to act” and “how to act” in robotic interaction with cross-category heterogeneous objects by proposing a task-conditioned two-stage framework. It first aligns the initial contact state using structural priors to reduce grasp uncertainty, then employs a multi-foundation-model diffusion policy for fine-grained interaction. The approach innovatively decouples grasping from execution, integrating category-specific foundation models and a dual-stream cross-attention mechanism. A novel Foundation-Correspondence-Guided Grasp module and the Multi-Foundation-Model Diffusion Policy (MFMDP) jointly leverage geometric and part-level features for decision-making. Evaluated in simulation and real-world settings, the method achieves average performance gains of 31% and 36.7% across four heterogeneous object categories, respectively, significantly improving generalization to diverse object shapes and poses.
📝 Abstract
Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: ``where to manipulate'' (contact point localization) and ``how to manipulate'' (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31\% performance improvement in simulation tasks with broad type setting, alongside a 36.7\% gain across four real-world tasks with different interaction types.
Problem

Research questions and friction points this paper is trying to address.

generalizable manipulation
heterogeneous object interactions
contact point localization
interaction trajectory planning
foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

generalizable manipulation
heterogeneous objects
two-stage framework
foundation models
diffusion policy