UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work proposes UniVidX, a unified multimodal video generation framework based on diffusion models that addresses the limitations of existing approaches, which typically design task-specific architectures and struggle to flexibly model cross-modal relationships. UniVidX formulates pixel-aligned multimodal tasks as conditional generation within a shared latent space. It enables omnidirectional conditional generation through stochastic conditioning masks, preserves diffusion priors via decoupled gated LoRA modules, and enhances modality alignment and interaction using cross-modal self-attention. Remarkably, with fewer than 1,000 training videos, UniVidX achieves state-of-the-art performance on tasks such as RGB-intrinsic decomposition and RGBA layered video generation, while demonstrating strong generalization to real-world scenarios.

📝 Abstract

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/

Problem

Research questions and friction points this paper is trying to address.

video diffusion models

multimodal generation

cross-modal consistency

conditional generation

modality correlation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Condition Masking

Decoupled Gated LoRA

Cross-Modal Self-Attention

Unified Multimodal Framework

Video Diffusion Priors

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31