Chain-of-Description: What I can understand, I can put into words

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient cross-modal semantic alignment and implicit coupling of reasoning processes in multimodal large language models (MLLMs), this paper proposes Chain-of-Description (CoD), a novel prompting strategy that explicitly decouples understanding from generation: the model is first prompted to produce fine-grained, structured textual descriptions of multimodal inputs; subsequent reasoning and answer generation are then conditioned solely on this explicit intermediate representation. CoD is the first approach to formalize such an explicit descriptive step as a core component of the reasoning chain. The method employs a two-stage prompting paradigm augmented with structured instruction tuning and is validated on Qwen2-Audio, Qwen2-VL, and Qwen2.5-VL. Experiments demonstrate consistent improvements: +3.9% on AIR-Bench-Chat (speech tasks) and +5.3% on the challenging MMMU_Pro subset. Ablation studies confirm the efficacy of both the explicit description phase and the structural decoupling design.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a novel strategy defined as Chain-of-Description (CoD) Prompting, tailored for Multi-Modal Large Language Models. This approach involves having the model first provide a detailed description of the multi-modal input before generating an answer to the question. When applied to models such as Qwen2-Audio, Qwen2-VL, and Qwen2.5-VL, CoD Prompting significantly enhances performance compared to standard prompting methods. This is demonstrated by nearly a 4% improvement in the speech category of the audio benchmark AIR-Bench-Chat and a 5.3% improvement in the hard-level portion of the vision benchmark MMMU_Pro. Our ablation study further validates the effectiveness of CoD Prompting.
Problem

Research questions and friction points this paper is trying to address.

Enhances Multi-Modal Large Language Models.
Improves speech and vision benchmark performance.
Introduces Chain-of-Description Prompting strategy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Description Prompting
Multi-Modal Large Language Models
Detailed input description before answering
🔎 Similar Papers
No similar papers found.
J
Jiaxin Guo
Huawei Translation Services Center, Beijing, China
D
Daimeng Wei
Huawei Translation Services Center, Beijing, China
Zongyao Li
Zongyao Li
Huawei Translation Services Center, Beijing, China
H
Hengchao Shang
Huawei Translation Services Center, Beijing, China
Yuanchang Luo
Yuanchang Luo
2012@Huawei
H
Hao Yang
Huawei Translation Services Center, Beijing, China