OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing reinforcement learning approaches for joint audio-visual generation suffer from inconsistent multi-objective advantages, imbalanced multimodal gradients, and inadequate unified credit assignment, leading to insufficient fine-grained alignment. This work proposes a modality-aware online diffusion reinforcement learning framework that enables precise cross-modal synchronous optimization through modality-disentangled advantage routing, hierarchical gradient surgery, and loss reweighting focused on critical alignment regions. Innovatively integrating diffusion models with reinforcement learning, the method introduces modality-aware advantage estimation, selective gradient clipping, and region-adaptive policy optimization. Evaluated on JavisBench and VBench using the LTX-2 backbone, the approach substantially improves audio-visual generation quality, cross-modal alignment accuracy, and fine-grained synchronization performance.

📝 Abstract

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

Problem

Research questions and friction points this paper is trying to address.

joint audio-video generation

reinforcement learning

multi-modal alignment

fine-grained synchronization

modality fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality-wise advantage routing

layer-wise gradient surgery

region-wise loss reweighting

multi-modal reinforcement learning

audio-video synchronization

🔎 Similar Papers

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

2024-05-28Citations: 3

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

2024-09-26arXiv.orgCitations: 4