Latent Swap Joint Diffusion for Long-Form Audio Generation

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing global or multi-view joint diffusion methods for long audio generation suffer from severe spectral overlap distortion, high computational cost for cross-view consistency modeling, and low inference efficiency. To address these issues, we propose SaFa, a forward-frame-level latent swapping framework. SaFa introduces a novel bidirectional self-cycling and unidirectional reference-guided latent exchange mechanism—replacing conventional averaging fusion—to enable synchronous multi-view diffusion without backward iteration. This mechanism jointly preserves high-frequency detail reconstruction and ensures global cross-view consistency, leveraging frame-level latent swapping in the latent space, self-cycling diffusion trajectory modeling, and spectrum-aware operations. Experiments demonstrate that SaFa achieves state-of-the-art performance on both long audio and panoramic audio generation tasks, with over 40% faster inference speed and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Previous work on long-form audio generation using global-view diffusion or iterative generation demands significant training or inference costs. While recent advancements in multi-view joint diffusion for panoramic generation provide an efficient option, they struggle with spectrum generation with severe overlap distortions and high cross-view consistency costs. We initially explore this phenomenon through the connectivity inheritance of latent maps and uncover that averaging operations excessively smooth the high-frequency components of the latent map. To address these issues, we propose Swap Forward (SaFa), a frame-level latent swap framework that synchronizes multiple diffusions to produce a globally coherent long audio with more spectrum details in a forward-only manner. At its core, the bidirectional Self-Loop Latent Swap is applied between adjacent views, leveraging stepwise diffusion trajectory to adaptively enhance high-frequency components without disrupting low-frequency components. Furthermore, to ensure cross-view consistency, the unidirectional Reference-Guided Latent Swap is applied between the reference and the non-overlap regions of each subview during the early stages, providing centralized trajectory guidance. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based long audio generation models. Moreover, we find that it also adapts well to panoramic generation, achieving comparable state-of-the-art performance with greater efficiency and model generalizability. Project page is available at https://swapforward.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Reduces training and inference costs
Improves spectrum generation quality
Ensures cross-view consistency efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Swap Forward frame-level framework
Self-Loop Latent Swap technique
Reference-Guided Latent Swap method
🔎 Similar Papers
No similar papers found.
Yusheng Dai
Yusheng Dai
Monash University
MultimodalSpeech ProcessingComputer Vison
C
Chenxi Wang
University of Science and Technology of China
C
Chang Li
University of Science and Technology of China
C
Chen Wang
Tsinghua University
J
Jun Du
University of Science and Technology of China
K
Kewei Li
University of Science and Technology of China
R
Ruoyu Wang
University of Science and Technology of China
Jiefeng Ma
Jiefeng Ma
USTC
NLP、Language Modelling、Document Intelligence
L
Lei Sun
iFlytek Research
J
Jianqing Gao
iFlytek Research