InCoM: Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of control optimization in whole-body mobile manipulation arising from strong coupling between the base and arm, as well as suboptimal perceptual attention allocation under dynamic viewpoints. To this end, we propose an intention-driven perception and structured coordination framework that infers latent motion intentions to dynamically reweight multi-scale perceptual features. By integrating a geometry-semantic structure alignment mechanism, our approach achieves robust cross-modal perception. Furthermore, we design a decoupled coordination flow that matches action decoding to effectively mitigate control coupling. Evaluated on three ManiSkill-HAB scenarios, our method improves task success rates by 28.2%, 26.1%, and 23.6%, respectively, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Whole-body mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates whole-body control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for whole-body mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Without access to privileged perceptual information, InCoM outperforms state-of-the-art methods on three ManiSkill-HAB scenarios by 28.2%, 26.1%, and 23.6% in success rate, demonstrating strong effectiveness for whole-body mobile manipulation.
Problem

Research questions and friction points this paper is trying to address.

whole-body mobile manipulation
control coupling
perceptual attention
dynamic viewpoint
mobile manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

intent-driven perception
structured coordination
whole-body mobile manipulation
decoupled action decoding
geometric-semantic alignment
🔎 Similar Papers
No similar papers found.
J
Jiahao Liu
Institute of Automation, Chinese Academy of Science, Beijing, China; School of Advanced Interdisciplinary Sciences, University of the Chinese Academy of Sciences, Beijing, China
W
Wenbo Cui
Institute of Automation, Chinese Academy of Science, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Haoran Li
Haoran Li
Institute of Automation,Chinese Academy of Sciences
Artificial IntelligenceRoboticsReinforcement LearningEmbodied Intelligence
Dongbin Zhao
Dongbin Zhao
Institute of Automation, Chinese Academy of Sciences
Deep Reinforcement LearningAdaptive Dynamic ProgrammingGame AISmart drivingrobotics