RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current data-driven humanoid robotics approaches rely heavily on large-scale annotated datasets, suffer from poor generalization, and lack geometric reasoning capabilities for unseen scenarios. To address these limitations, we propose RGMP—a unified end-to-end framework integrating geometric-semantic skill reasoning with data-efficient visuomotor control. RGMP introduces a geometric-prior-guided skill selector for few-shot generation of skill sequences; an adaptive recursive Gaussian network that fuses vision-language models with geometric inductive biases to model multi-scale spatial relationships; and a hierarchical probabilistic motion representation coupled with multimodal policy learning. Evaluated on a dual-arm robotic platform, RGMP achieves an 87% task success rate, attains fivefold higher data efficiency than state-of-the-art methods, and demonstrates significantly improved cross-scenario generalization—particularly in geometrically novel environments.

Technology Category

Application Category

📝 Abstract

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5x greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient multimodal decision-making in humanoid robot manipulation

Solves geometric reasoning limitations in unseen robotic scenarios

Improves data efficiency for generalizable visuomotor control

Innovation

Methods, ideas, or system contributions that make the work stand out.

RGMP integrates geometric reasoning with visuomotor control

Geometric-prior Skill Selector adapts skills for unseen scenes

Adaptive Recursive Gaussian Network encodes robot-object interactions

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey