π€ AI Summary
Existing multi-modal chain-of-thought (MCoT) methods rely heavily on large-scale annotated reasoning chains and emphasize inter-object relational reasoning, neglecting intra-object fine-grained understanding essential for image classification. This work proposes WISE, the first framework to transform the structured concept representations of concept bottleneck models (CBMs) into interpretable, semantic-concept-driven multimodal chains of thought under weak supervision. WISE enables end-to-end reasoning enhancement without requiring manually annotated reasoning tracesβonly image-class labels are needed to generate stepwise, concept-based explanations. By integrating CBMs, weakly supervised learning, and MCoT generation, WISE achieves consistent improvements across 10 benchmark datasets: it boosts reasoning chain interpretability by 37% and significantly enhances classification accuracy. The approach provides an efficient, scalable, and fine-grained interpretability solution for multimodal large language models.
π Abstract
Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.