OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing approaches struggle to simultaneously integrate multiple conditioning modalities—such as text, reference images, audio, and pose—to generate high-quality human-object interaction videos. To address this challenge, this work proposes OmniShow, an end-to-end multimodal video generation framework that effectively unifies heterogeneous conditional signals through a unified channel-wise conditioning injection mechanism and gated local contextual attention. We introduce a decoupled-joint training strategy and construct HOIVG-Bench, the first benchmark specifically designed for evaluating human-object interaction video generation. Experimental results demonstrate that OmniShow achieves state-of-the-art performance across diverse multimodal conditions, significantly improving video quality, controllability, and audio-visual synchronization accuracy.

Technology Category

Application Category

📝 Abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction

Video Generation

Multimodal Conditioning

Content Creation

HOIVG

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Channel-wise Conditioning

Gated Local-Context Attention

Decoupled-Then-Joint Training