OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the high latency of existing joint audio-visual diffusion models, which stems from bidirectional attention mechanisms and impedes real-time generation. To overcome this limitation, the authors propose a streaming autoregressive generation framework based on knowledge distillation, efficiently transferring an offline bidirectional teacher model to a low-latency student model. Key innovations include asymmetric block-causal alignment to account for inter-modal information density disparities, audio Sink Tokens to mitigate gradient explosion, and joint self-forcing distillation to alleviate exposure bias. The approach further integrates causal distillation, zero-truncated global prefixing, Identity RoPE constraints, and a modality-agnostic rolling KV cache for efficient inference. Evaluated on a single GPU, the method achieves approximately 25 FPS for real-time synchronized audio-visual generation while preserving multimodal alignment and visual fidelity comparable to the teacher model.

Technology Category

Application Category

📝 Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}

Problem

Research questions and friction points this paper is trying to address.

real-time generation

audio-visual synchronization

diffusion models

latency

streaming generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming audio-visual generation

causal distillation

asymmetric block-causal alignment