🤖 AI Summary
Existing concept-based explanation methods typically model concepts as scalar attributes, lacking pixel-level localizability—thus failing to meet the stringent interpretability requirements of high-stakes domains such as healthcare. To address this, we propose the Multi-Concept Learnable Token Transformer (MCT-Transformer), the first framework enabling end-to-end joint optimization of concept prediction and pixel-level localization. Our approach introduces vision-language bimodal learnable concept tokens and a dedicated cross-modal attention mechanism to jointly generate concept-level localization maps. The architecture preserves high classification accuracy while substantially improving both concept detection and localization performance. Experiments on synthetic and real-world medical imaging datasets demonstrate that MCT-Transformer outperforms state-of-the-art concept-based models, achieving concept localization accuracy comparable to black-box saliency methods. Crucially, it establishes a verifiable and traceable explanatory paradigm for high-assurance AI decision-making.
📝 Abstract
Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.