InCoM: Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the challenges of control optimization in whole-body mobile manipulation arising from strong coupling between the base and arm, as well as suboptimal perceptual attention allocation under dynamic viewpoints. To this end, we propose an intention-driven perception and structured coordination framework that infers latent motion intentions to dynamically reweight multi-scale perceptual features. By integrating a geometry-semantic structure alignment mechanism, our approach achieves robust cross-modal perception. Furthermore, we design a decoupled coordination flow that matches action decoding to effectively mitigate control coupling. Evaluated on three ManiSkill-HAB scenarios, our method improves task success rates by 28.2%, 26.1%, and 23.6%, respectively, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Whole-body mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates whole-body control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for whole-body mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Without access to privileged perceptual information, InCoM outperforms state-of-the-art methods on three ManiSkill-HAB scenarios by 28.2%, 26.1%, and 23.6% in success rate, demonstrating strong effectiveness for whole-body mobile manipulation.

Problem

Research questions and friction points this paper is trying to address.

whole-body mobile manipulation

control coupling

perceptual attention

dynamic viewpoint

mobile manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

intent-driven perception

structured coordination

whole-body mobile manipulation