FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the labor-intensive and time-consuming nature of manual spatiotemporal alignment in cinematic foley production by introducing FoleyDesigner, a framework that emulates professional foley workflows through the integration of video analysis, controllable audio generation, and expert-level mixing to synthesize immersive stereophonic sound with high-precision spatiotemporal alignment. FoleyDesigner pioneers a multi-agent architecture combined with an LLM-driven hybrid mechanism and introduces FilmStereo—the first professional stereophonic foley dataset annotated with spatial metadata and semantic labels—compatible with industry standards such as Dolby Atmos. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in alignment accuracy, seamlessly integrates into professional film production pipelines, and supports interactive control alongside 5.1-channel Dolby Atmos output.
📝 Abstract
Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .
Problem

Research questions and friction points this paper is trying to address.

Foley generation
spatio-temporal alignment
immersive audio
stereo sound
film post-production
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal alignment
latent diffusion models
multi-agent architecture
Foley generation
stereo audio dataset
M
Mengtian Li
Shanghai Film Academy, Shanghai University
K
Kunyan Dai
Shanghai Film Academy, Shanghai University
Y
Yi Ding
Shanghai Film Academy, Shanghai University
R
Ruobing Ni
Shanghai Film Academy, Shanghai University
Y
Ying Zhang
Shanghai Film Academy, Shanghai University
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion
Z
Zhifeng Xie
Shanghai Film Academy, Shanghai University