🤖 AI Summary
Existing concept-based interpretability methods are largely confined to unimodal image settings and struggle to address the challenges of cross-modal semantic alignment and explanation in vision-language models. This work proposes CoCCA, a novel framework that introduces Canonical Correlation Analysis (CCA) into multimodal concept-level interpretability for the first time, revealing its intrinsic connection to the InfoNCE objective. Building upon this foundation, the authors further incorporate sparsity constraints to develop Sparse CoCCA (SCoCCA), which enhances concept disentanglement and discriminability. The proposed approach achieves state-of-the-art performance across concept discovery, reconstruction, and ablation tasks, significantly advancing the interpretability and semantic controllability of multimodal models.
📝 Abstract
Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.