FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio (V2A) generation methods struggle to achieve fine-grained temporal control in multi-event scenes or when visual cues are insufficient. To address this, this work proposes FoleyDirector, a novel framework that introduces Structured Temporal Scripts (STS) as segment-wise temporal guidance within a DiT-based V2A model for the first time. The approach incorporates a script-guided temporal fusion module and a temporal script attention mechanism, combined with a dual-frame sound synthesis strategy, enabling parallel generation of intra-frame and extra-frame sounds and seamless switching between standard and controlled generation modes. Evaluated on the newly curated DirectorSound dataset and benchmark suites VGGSoundDirector and DirectorBench, the method demonstrates significant improvements in both temporal controllability and audio fidelity, achieving director-level precision in foley sound manipulation.

Technology Category

Application Category

📝 Abstract
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Problem

Research questions and friction points this paper is trying to address.

Video-to-Audio Generation
Temporal Control
Fine-Grained Steering
Multi-Event Scenarios
Insufficient Visual Cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Temporal Scripts
Temporal Script Attention
Bi-Frame Sound Synthesis
Fine-Grained Temporal Control
Video-to-Audio Generation
🔎 Similar Papers
No similar papers found.
Y
You Li
The State Key Lab of Brain-Machine Intelligence, Zhejiang University
D
Dewei Zhou
The State Key Lab of Brain-Machine Intelligence, Zhejiang University
F
Fan Ma
The State Key Lab of Brain-Machine Intelligence, Zhejiang University
F
Fu Li
Intelligent Creation, ByteDance, China
Dongliang He
Dongliang He
ByteDance Inc.
Computer VisionDeep LearningMultimedia
Yi Yang
Yi Yang
Zhejiang University
multimediacomputer visionmachine learning