ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio generation methods struggle to simultaneously achieve high-fidelity audio synthesis and precise visual-auditory alignment—particularly when modeling dynamic visual cues, acoustic environments, and temporal relationships. To address this, we propose ThinkSound, the first framework to introduce a three-stage chain-of-thought (CoT) reasoning paradigm into multimodal audio generation: (i) foundational Foley synthesis, (ii) object-centric interactive refinement, and (iii) natural language-guided editing—all orchestrated by a multimodal large model jointly with an audio foundation model. Our key contributions are: (1) AudioCoT, the first structured reasoning dataset explicitly designed for audio-visual alignment; (2) a semantic-preserving, user-controllable interactive audio editing interface; and (3) state-of-the-art performance on fidelity metrics (e.g., FAD, KL divergence) and CoT reasoning capability, with significant gains on the out-of-distribution Movie Gen Audio benchmark.

Technology Category

Application Category

📝 Abstract
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present extbf{ThinkSound}, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce extbf{AudioCoT}, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Demo.github.io.
Problem

Research questions and friction points this paper is trying to address.

Producing high-fidelity audio from visual content nuances
Enabling stepwise, interactive audio generation and editing
Linking visual content, text, and sound via structured reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought reasoning for audio generation
Three-stage interactive audio refinement
Multimodal LLM-guided audio foundation model
🔎 Similar Papers
No similar papers found.
H
Huadai Liu
Tongyi Lab, Alibaba Group
Jialei Wang
Jialei Wang
University of Chicago
Machine LearningStatisticsOptimization
K
Kaicheng Luo
Zhejiang University
W
Wen Wang
Tongyi Lab, Alibaba Group
Q
Qian Chen
Tongyi Lab, Alibaba Group
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
W
Wei Xue
Hong Kong University of Science and Technology (HKUST)