Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the limited cross-embodiment generalization of foundational robotic policies across humanoid platforms. We propose a modality-augmented fine-tuning framework that enhances perception inputs by fusing contact state—represented as both binary contact signals and real-valued contact forces—and metric-scale depth maps generated by ZoeDepth. The framework integrates RGB-D sensing, cuRobo motion planning, and inverse kinematics for end-to-end policy optimization. Unlike standard fine-tuning or zero-shot transfer, our approach employs lightweight post-processing and a high-quality, multimodal data-driven paradigm to improve morphological adaptability. Experimental results demonstrate significant gains in cross-platform task success: from 51% to 63% on GR1, and 94% on G1 for the “pick apple and place in bowl” task. These improvements validate the efficacy of joint contact-depth modeling for enhancing operational generalization across diverse humanoid embodiments.

Technology Category

Application Category

📝 Abstract

This paper presents a modality-augmented fine-tuning framework designed to adapt foundation robot policies to diverse humanoid embodiments. We validate our approach across two distinct settings: (i) the GR1 embodiment, utilizing public datasets where we introduce post-processed modalities, including binary contact signals and ZoeDepth-generated metric depth; and (ii) the Unitree G1 embodiment, for which we contribute a novel multi-modal dataset incorporating cuRobo motion planning, inverse kinematics, and ground-truth contact-force measurements. Our experiments demonstrate that modality augmentation consistently enhances policy performance across different embodiments. Specifically, for the GR1, integrating contact-state cues and RGB-D fusion improves online success rates from 51% to 63%. Furthermore, in the G1 "Pick Apple to Bowl" task, our contact-augmented model achieves a success rate of 94%, significantly outperforming the 48% achieved by standard fine-tuning and the 0% baseline of zero-shot transfer. These results highlight that lightweight post-processing effectively strengthens policies for GR1, while high-quality multi-modal data is crucial for reliable transfer to the Unitree G1. Consequently, this work establishes a unified, data-centric pathway for extending foundation robot policies through targeted modality design and multi-modal fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Adapting robot policies to diverse humanoid embodiments

Enhancing policy performance through modality augmentation

Establishing a data-centric pathway for cross-embodiment manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-augmented fine-tuning adapts policies across humanoid robots

Post-processed contact and depth signals enhance GR1 policy performance

Multi-modal dataset with motion planning enables reliable G1 transfer

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey