Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the challenges of low accuracy, poor generalization, and limited scalability in 4D human-object interaction (HOI) motion reconstruction from monocular internet videos. We propose 4DHOISolver, an optimization framework that jointly leverages monocular visual cues, high-fidelity human pose estimation, physics-based constraints—including contact forces and collision response—and sparsely annotated human-object contact points to ensure spatiotemporal coherence while significantly improving reconstruction fidelity and cross-scene generalization. Furthermore, we introduce Open4DHOI, the first large-scale, open-source 4D HOI dataset, comprising 144 object categories, 103 interaction actions, and diverse real-world video scenarios. The reconstructed high-quality 4D motion sequences are empirically validated for downstream applications, including action imitation and policy learning in reinforcement learning agents.

Technology Category

Application Category

📝 Abstract

Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/

Problem

Research questions and friction points this paper is trying to address.

Extracting 4D human-object interaction data from monocular videos

Reconstructing physically plausible 4D HOI with spatio-temporal coherence

Automating precise human-object contact correspondence prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular video optimization with human contact annotations

Large-scale 4D dataset creation for diverse human-object interactions

Reinforcement learning agent imitation from reconstructed motions

🔎 Similar Papers

No similar papers found.