ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing concept-based explanation methods typically model concepts as scalar attributes, lacking pixel-level localizability—thus failing to meet the stringent interpretability requirements of high-stakes domains such as healthcare. To address this, we propose the Multi-Concept Learnable Token Transformer (MCT-Transformer), the first framework enabling end-to-end joint optimization of concept prediction and pixel-level localization. Our approach introduces vision-language bimodal learnable concept tokens and a dedicated cross-modal attention mechanism to jointly generate concept-level localization maps. The architecture preserves high classification accuracy while substantially improving both concept detection and localization performance. Experiments on synthetic and real-world medical imaging datasets demonstrate that MCT-Transformer outperforms state-of-the-art concept-based models, achieving concept localization accuracy comparable to black-box saliency methods. Crucially, it establishes a verifiable and traceable explanatory paradigm for high-assurance AI decision-making.

Technology Category

Application Category

📝 Abstract

Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.

Problem

Research questions and friction points this paper is trying to address.

Providing visual explanations for concept-based medical image analysis models

Localizing predicted concepts while maintaining high predictive accuracy

Overcoming limitations of numerical concept representations in high-stakes scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-concept tokens jointly predict and localize concepts

Specialized attention layers process visual and text tokens

Produces concept-level localization maps with high accuracy

🔎 Similar Papers

No similar papers found.