Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study reveals a critical semantic inconsistency in vision transformers (ViTs) for medical image classification: minimal input perturbations induce substantial representation shifts in latent space, severely degrading classification reliability and posing safety risks for clinical deployment. To address this, we propose the first systematic representation stability analysis framework based on projected gradient descent (PGD), enabling quantitative evaluation of semantic consistency and classification robustness in ViT hidden representations. Experiments demonstrate that under only 0.5% L∞-bounded adversarial perturbations, state-of-the-art ViTs suffer an average accuracy drop exceeding 60%; moreover, semantically distinct medical images across classes collapse to highly similar latent vectors (cosine similarity > 0.98). Our work not only identifies a fundamental semantic fragility of ViTs in healthcare contexts but also provides a reproducible diagnostic toolkit and standardized benchmark—laying essential groundwork for the design, evaluation, and certification of safe, trustworthy AI models in medical imaging.

Technology Category

Application Category

📝 Abstract

Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.

Problem

Research questions and friction points this paper is trying to address.

Assessing semantic meaningfulness of Vision Transformer representations

Identifying vulnerability to small changes in medical images

Evaluating reliability of ViT-based medical image classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projected gradient algorithm analyzes ViT representations

Reveals ViT vulnerability to small image changes

First systematic study on ViT semantic meaningfulness

🔎 Similar Papers

No similar papers found.