OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the challenge of efficiently and robustly learning robotic manipulation skills from human demonstration videos while overcoming background clutter and visually imperceptible object properties. The authors propose an object-centric multimodal learning framework that reconstructs task-relevant 3D point clouds from multi-view RGB videos, integrates large-scale tactile priors, and employs a novel ResFiLM multimodal fusion mechanism to guide a diffusion policy in generating precise actions. For the first time, the approach combines a 3D vision foundation model (VGGT) with million-scale tactile data to enable end-to-end transfer from raw video demonstrations to robust manipulation policies. Experimental results demonstrate that the method significantly outperforms existing approaches in both purely visual and visuo-tactile manipulation tasks, validating its effectiveness and robustness.

Technology Category

Application Category

📝 Abstract

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Problem

Research questions and friction points this paper is trying to address.

human-to-robot action transfer

object-centric learning

3D perception

tactile priors

robot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Centric Learning

3D-Tactile Fusion

Human-to-Robot Action Transfer