Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the challenges of whole-body humanoid-object interaction, which are hindered by the scarcity of high-fidelity 3D data and limitations in existing approaches—particularly representation misalignment and complex retargeting pipelines. The authors propose the first zero-shot framework for humanoid-object interaction, introducing a geometry-agnostic unified representation based on 4D point trajectories to model both robot and object motion without relying on explicit CAD models. By tracking only sparse keypoints on the base, hands, and object, the method entirely eliminates the need for motion retargeting. Natural locomotion and robust interaction are achieved through latent-space optimization using a Behavior Foundation Model (BFM) combined with reinforcement learning driven by simple tracking-based rewards. The approach enables efficient, natural zero-shot interaction and can be deployed directly on real-world motion capture systems without task-specific training.
📝 Abstract
Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.
Problem

Research questions and friction points this paper is trying to address.

Humanoid-Object Interaction
Representation Misalignment
Retargeting Complexity
Zero-shot Learning
3D Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot humanoid-object interaction
4D point trajectories
geometry-free interaction
Behavior Foundation Model
Keypoints Tracker
🔎 Similar Papers
No similar papers found.