Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing video-to-audio generation methods rely heavily on audio supervision and struggle to produce high-quality audio descriptions, especially when audio input is unavailable. This limits multimodal large models’ ability to infer plausible audio content solely from visual cues. Method: We propose Silent Video Audio Description (SVAD), a novel task requiring accurate audio description generation from silent videos via visual-only reasoning. To address this, we (1) formally define the SVAD task; (2) introduce CoT-AudioCaps—the first dataset supporting chain-of-thought (CoT)–guided audio description generation; and (3) design a CoT-aware supervised fine-tuning strategy to enhance vision-language models’ cross-modal sound semantic modeling. Results: Experiments demonstrate significant improvements in both accuracy and diversity of generated audio descriptions. Moreover, SVAD-trained models yield more reliable and robust audio prompts for downstream video-to-audio (VT2A) generation, advancing audio-aware multimodal understanding without audio supervision.

Technology Category

Application Category

📝 Abstract

Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.

Problem

Research questions and friction points this paper is trying to address.

Explores multimodal models' ability to infer sounds from silent videos

Introduces SVAD task to generate audio descriptions without sound input

Proposes method to enhance vision-language models' modal-mismatch reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language models for silent video reasoning

Chain-of-Thought-based supervised fine-tuning strategy

CoT-AudioCaps dataset enhances audio description reasoning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs