WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification

πŸ“… 2025-09-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multi-modal chain-of-thought (MCoT) methods rely heavily on large-scale annotated reasoning chains and emphasize inter-object relational reasoning, neglecting intra-object fine-grained understanding essential for image classification. This work proposes WISE, the first framework to transform the structured concept representations of concept bottleneck models (CBMs) into interpretable, semantic-concept-driven multimodal chains of thought under weak supervision. WISE enables end-to-end reasoning enhancement without requiring manually annotated reasoning tracesβ€”only image-class labels are needed to generate stepwise, concept-based explanations. By integrating CBMs, weakly supervised learning, and MCoT generation, WISE achieves consistent improvements across 10 benchmark datasets: it boosts reasoning chain interpretability by 37% and significantly enhances classification accuracy. The approach provides an efficient, scalable, and fine-grained interpretability solution for multimodal large language models.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.
Problem

Research questions and friction points this paper is trying to address.

Existing MCoT methods overlook intra-object understanding in image classification
Current approaches rely on rationale-rich datasets limiting their applicability
There is a gap between concept interpretability and generative MCoT reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multimodal reasoning chains from concept models
Uses weak supervision to create interpretable step-by-step explanations
Reformulates concept bottleneck representations into reasoning chains
πŸ”Ž Similar Papers
No similar papers found.