Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the common lack of physical plausibility in human-object interaction (HOI) trajectories reconstructed from monocular videos, which limits their utility in robotic simulation. The authors propose HA-HOI, a novel framework that introduces a “human-first, object-follows” strategy: human motion serves as the interaction anchor, guiding the reconstruction and optimization of object trajectories to align with human poses, followed by physics-based refinement to produce stable and executable interaction animations. This approach represents the first effort to advance monocular HOI reconstruction from visual plausibility to physical plausibility by integrating explicit interaction modeling and simulation-ready trajectory generation. Experiments demonstrate significant improvements over existing methods in human-object alignment accuracy, contact consistency, temporal stability, and simulation usability on both benchmark datasets and real-world videos.
📝 Abstract
Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/
Problem

Research questions and friction points this paper is trying to address.

human-object interaction
monocular video
physical plausibility
4D reconstruction
simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

physically plausible HOI
human-first object-follow
monocular video reconstruction
physics-based simulation
4D human-object interaction
🔎 Similar Papers
No similar papers found.