🤖 AI Summary
Existing video-to-audio generation methods rely heavily on audio supervision and struggle to produce high-quality audio descriptions, especially when audio input is unavailable. This limits multimodal large models’ ability to infer plausible audio content solely from visual cues.
Method: We propose Silent Video Audio Description (SVAD), a novel task requiring accurate audio description generation from silent videos via visual-only reasoning. To address this, we (1) formally define the SVAD task; (2) introduce CoT-AudioCaps—the first dataset supporting chain-of-thought (CoT)–guided audio description generation; and (3) design a CoT-aware supervised fine-tuning strategy to enhance vision-language models’ cross-modal sound semantic modeling.
Results: Experiments demonstrate significant improvements in both accuracy and diversity of generated audio descriptions. Moreover, SVAD-trained models yield more reliable and robust audio prompts for downstream video-to-audio (VT2A) generation, advancing audio-aware multimodal understanding without audio supervision.
📝 Abstract
Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.