🤖 AI Summary
This work addresses the labor-intensive and time-consuming nature of manual spatiotemporal alignment in cinematic foley production by introducing FoleyDesigner, a framework that emulates professional foley workflows through the integration of video analysis, controllable audio generation, and expert-level mixing to synthesize immersive stereophonic sound with high-precision spatiotemporal alignment. FoleyDesigner pioneers a multi-agent architecture combined with an LLM-driven hybrid mechanism and introduces FilmStereo—the first professional stereophonic foley dataset annotated with spatial metadata and semantic labels—compatible with industry standards such as Dolby Atmos. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in alignment accuracy, seamlessly integrates into professional film production pipelines, and supports interactive control alongside 5.1-channel Dolby Atmos output.
📝 Abstract
Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .