CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing diffusion models struggle to maintain structural consistency in critical regions—such as hands and faces—and physical plausibility of human-object interactions (e.g., avoiding interpenetration) when generating videos. To address this, this work proposes an end-to-end diffusion Transformer framework that jointly leverages human images, object images, text prompts, and speech audio. The architecture employs a dual-stream RGB-structure generation design coupled with a mixture-of-experts mechanism guided by spatially aware routing, enabling fine-grained regional modeling. By incorporating interaction-aware geometric priors during training and requiring no additional computational overhead at inference, the method significantly outperforms existing approaches in terms of structural stability, logical coherence, and realism of interactions.

Technology Category

Application Category

📝 Abstract

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

Problem

Research questions and friction points this paper is trying to address.

human-object interaction

video synthesis

structural stability

physical plausibility

diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-Object Interaction

Diffusion Transformer

Mixture-of-Experts