OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
Existing approaches struggle to simultaneously integrate multiple conditioning modalities—such as text, reference images, audio, and pose—to generate high-quality human-object interaction videos. To address this challenge, this work proposes OmniShow, an end-to-end multimodal video generation framework that effectively unifies heterogeneous conditional signals through a unified channel-wise conditioning injection mechanism and gated local contextual attention. We introduce a decoupled-joint training strategy and construct HOIVG-Bench, the first benchmark specifically designed for evaluating human-object interaction video generation. Experimental results demonstrate that OmniShow achieves state-of-the-art performance across diverse multimodal conditions, significantly improving video quality, controllability, and audio-visual synchronization accuracy.

Technology Category

Application Category

📝 Abstract
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction
Video Generation
Multimodal Conditioning
Content Creation
HOIVG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Channel-wise Conditioning
Gated Local-Context Attention
Decoupled-Then-Joint Training
HOIVG-Bench
Multimodal Video Generation
🔎 Similar Papers
No similar papers found.