Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing video-to-audio (V2A) methods focus primarily on Foley sound generation and struggle to synthesize intelligible speech; conversely, environment-aware speech synthesis relies on textual input and lacks temporal alignment with video. This paper introduces the first end-to-end video-to-environment-aware speech synthesis framework, enabling precise temporal synchronization between spoken content and dynamic visual scenes. Our method employs a two-stage modeling pipeline: (1) video-guided audio semantic prediction (V2AS), which generates unified audio semantic tokens, and (2) video-conditioned semantic-to-acoustic conversion (VS2A), which produces high-fidelity acoustic tokens while jointly modeling speech and background sound effects. Extensive experiments demonstrate significant improvements over state-of-the-art methods on both contextual speech synthesis and immersive audio background transfer tasks. Ablation studies validate the effectiveness of each component. The code and interactive demo system are publicly released.

Technology Category

Application Category

📝 Abstract

The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this paper, we propose Beyond Video-to-SFX (BVS), a method to generate synchronized audio with environmentally aware intelligible speech for given videos. We introduce a two-stage modeling method: (1) stage one is a video-guided audio semantic (V2AS) model to predict unified audio semantic tokens conditioned on phonetic cues; (2) stage two is a video-conditioned semantic-to-acoustic (VS2A) model that refines semantic tokens into detailed acoustic tokens. Experiments demonstrate the effectiveness of BVS in scenarios such as video-to-context-aware speech synthesis and immersive audio background conversion, with ablation studies further validating our design. Our demonstration is available at~href{https://xinleiniu.github.io/BVS-demo/}{BVS-Demo}.

Problem

Research questions and friction points this paper is trying to address.

Generate environmentally aware speech synchronized with video

Overcome limitations of text-driven speech synthesis alignment

Produce intelligible audio beyond traditional Foley sound generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage video-to-audio synthesis method

Video-guided semantic token prediction

Video-conditioned acoustic token refinement

🔎 Similar Papers

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound