Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio (V2A) methods focus primarily on Foley sound generation and struggle to synthesize intelligible speech; conversely, environment-aware speech synthesis relies on textual input and lacks temporal alignment with video. This paper introduces the first end-to-end video-to-environment-aware speech synthesis framework, enabling precise temporal synchronization between spoken content and dynamic visual scenes. Our method employs a two-stage modeling pipeline: (1) video-guided audio semantic prediction (V2AS), which generates unified audio semantic tokens, and (2) video-conditioned semantic-to-acoustic conversion (VS2A), which produces high-fidelity acoustic tokens while jointly modeling speech and background sound effects. Extensive experiments demonstrate significant improvements over state-of-the-art methods on both contextual speech synthesis and immersive audio background transfer tasks. Ablation studies validate the effectiveness of each component. The code and interactive demo system are publicly released.

Technology Category

Application Category

📝 Abstract
The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this paper, we propose Beyond Video-to-SFX (BVS), a method to generate synchronized audio with environmentally aware intelligible speech for given videos. We introduce a two-stage modeling method: (1) stage one is a video-guided audio semantic (V2AS) model to predict unified audio semantic tokens conditioned on phonetic cues; (2) stage two is a video-conditioned semantic-to-acoustic (VS2A) model that refines semantic tokens into detailed acoustic tokens. Experiments demonstrate the effectiveness of BVS in scenarios such as video-to-context-aware speech synthesis and immersive audio background conversion, with ablation studies further validating our design. Our demonstration is available at~href{https://xinleiniu.github.io/BVS-demo/}{BVS-Demo}.
Problem

Research questions and friction points this paper is trying to address.

Generate environmentally aware speech synchronized with video
Overcome limitations of text-driven speech synthesis alignment
Produce intelligible audio beyond traditional Foley sound generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage video-to-audio synthesis method
Video-guided semantic token prediction
Video-conditioned acoustic token refinement
🔎 Similar Papers
No similar papers found.
X
Xinlei Niu
Australian National University
J
Jianbo Ma
Dolby Laboratories
D
Dylan Harper-Harris
Dolby Laboratories
X
Xiangyu Zhang
The University of New South Wales
Charles Patrick Martin
Charles Patrick Martin
The Australian National University
computer musicnew interfaces for musical expressionnimehcihuman-computer interaction
J
Jing Zhang
Australian National University