AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for language-instructed mobile manipulation in domestic service robotics struggle to coordinate the mobile base and robotic arm. They neither explicitly model the dynamic influence of base motion on arm control—leading to error accumulation in high-dimensional action spaces—nor adaptively leverage multimodal sensory inputs, relying instead on static, unimodal 2D visual observations while ignoring stage-dependent perceptual requirements for both 2D images and 3D point clouds. Method: We propose an end-to-end diffusion Transformer framework featuring two key innovations: (i) a mobility-to-body conditioning mechanism that explicitly models base-body coupling dynamics, and (ii) a perception-adaptive multimodal fusion strategy that dynamically modulates the weights of 2D and 3D features across task phases. Contribution/Results: Extensive experiments in simulation and on real-world robotic platforms demonstrate significant improvements in task success rate and trajectory accuracy, alongside strong robustness and practical deployability.

Technology Category

Application Category

📝 Abstract
Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base's motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Coordinating mobile base and manipulator in robotic control
Modeling influence of mobile base on manipulator actions
Adapting multimodal perception for different manipulation stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobility-to-body conditioning for coordination
Perception-aware multimodal fusion strategy
Adaptive 2D and 3D feature weighting
🔎 Similar Papers
No similar papers found.
S
Sixiang Chen
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
S
Siyuan Qian
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Han Jiang
Han Jiang
Johns Hopkins University
Natural Language GenerationSocietal AIModel Evaluation
L
Lily Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
Zhuoyang Liu
Zhuoyang Liu
Peking University
Embodied AIComputer Vision
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
Chengkai Hou
Chengkai Hou
Peking University
Robot
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Z
Zhongyuan Wang
Beijing Academy of Artificial Intelligence (BAAI)
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models