Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Transformer models for chest X-ray diagnosis often capture spurious correlations, leading to poor generalization and bias. To address this, we propose a hybrid interpretability-guided learning framework that integrates self-supervised attention alignment—requiring no manual annotations—with sparse expert-provided explanations, thereby substantially reducing reliance on costly human labels. Methodologically, built upon the Vision Transformer (ViT) architecture, we design a class-discriminative attention mechanism and a hybrid interpretability constraint strategy that explicitly regularizes attention distributions to align with clinical priors. Experiments on multi-center chest X-ray classification demonstrate that our approach surpasses existing interpretability-guided methods in both classification accuracy and cross-dataset generalization. Moreover, the generated attention maps exhibit stronger alignment with clinical reasoning, while maintaining robustness and faithful interpretability.

Technology Category

Application Category

📝 Abstract

Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.

Problem

Research questions and friction points this paper is trying to address.

Reducing spurious correlations in transformer-based medical imaging

Enhancing attention alignment without costly manual supervision

Improving generalization in chest X-ray diagnosis models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework combining self-supervised and human-guided constraints

Self-supervised component uses class-distinctive attention without priors

Enhances attention alignment and generalization in Vision Transformer

🔎 Similar Papers

No similar papers found.