RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current data-driven humanoid robotics approaches rely heavily on large-scale annotated datasets, suffer from poor generalization, and lack geometric reasoning capabilities for unseen scenarios. To address these limitations, we propose RGMP—a unified end-to-end framework integrating geometric-semantic skill reasoning with data-efficient visuomotor control. RGMP introduces a geometric-prior-guided skill selector for few-shot generation of skill sequences; an adaptive recursive Gaussian network that fuses vision-language models with geometric inductive biases to model multi-scale spatial relationships; and a hierarchical probabilistic motion representation coupled with multimodal policy learning. Evaluated on a dual-arm robotic platform, RGMP achieves an 87% task success rate, attains fivefold higher data efficiency than state-of-the-art methods, and demonstrates significantly improved cross-scenario generalization—particularly in geometrically novel environments.

Technology Category

Application Category

📝 Abstract
Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5x greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient multimodal decision-making in humanoid robot manipulation
Solves geometric reasoning limitations in unseen robotic scenarios
Improves data efficiency for generalizable visuomotor control
Innovation

Methods, ideas, or system contributions that make the work stand out.

RGMP integrates geometric reasoning with visuomotor control
Geometric-prior Skill Selector adapts skills for unseen scenes
Adaptive Recursive Gaussian Network encodes robot-object interactions
🔎 Similar Papers
No similar papers found.
X
Xuetao Li
School of Computer Science, Wuhan University
Wenke Huang
Wenke Huang
School of Computer Science, Wuhan University
Federated LearningMLLM
N
Nengyuan Pan
Faculty of Artificial Intelligence, Hubei University
Kaiyan Zhao
Kaiyan Zhao
The University of Tokyo
Natural Language Processing
S
Songhua Yang
School of Computer Science, Wuhan University
Y
Yiming Wang
State Key Laboratory of Internet of Things for Smart City, University of Macau
M
Mengde Li
Institute of Technological Sciences, Wuhan University
Mang Ye
Mang Ye
Professor, Wuhan University
Multimodal LearningPerson Re-identificationFederated Learning
Jifeng Xuan
Jifeng Xuan
Wuhan University
Software EngineeringTestingDebuggingMining Software RepositoriesSBSE
M
Miao Li
Institute of Technological Sciences, Wuhan University; School of Robotics, Wuhan University