MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing methods struggle to replay human-object interaction videos involving complex non-planar manipulations—such as 3D object rotation—and are limited to simple in-plane motions. To address this limitation, this work proposes MVHOI, a novel framework that, for the first time, integrates 3D foundation models with controllable video generation. The approach operates in two stages: first, it leverages multi-view reference images to construct a viewpoint-consistent 3D object prior; second, it employs a two-stage mutual enhancement mechanism to guide a video generation model in synthesizing high-fidelity, temporally extended interaction sequences. Extensive experiments demonstrate that MVHOI significantly outperforms existing approaches in scenarios involving intricate 3D manipulations, achieving high-quality replay of human-object interactions with precise control and consistent appearance.

Technology Category

Application Category

📝 Abstract

Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.

Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction

video reenactment

3D object manipulation

multi-view condition

complex motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Foundation Model

Multi-view Conditioning

Human-Object Interaction