Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical large vision-language models (Med-LVLMs) frequently suffer from hallucinations and erroneous outputs due to misaligned visual attention distributions. To address this, we propose A³Tune, an automated attention alignment fine-tuning framework. Our method introduces two key innovations: (1) a zero-shot, weakly supervised attention head recalibration mechanism that identifies and corrects biased attention without human annotations; and (2) the A³MoE module, which integrates SAM-based zero-shot segmentation and BioMedCLIP-guided semantic prompting to enable adaptive expert routing and parameter scheduling across multimodal prompts. Evaluated on medical visual question answering and radiology report generation, A³Tune achieves significant improvements over state-of-the-art methods. Crucially, its learned attention distributions better conform to clinical anatomical priors, yielding simultaneous gains in both quantitative accuracy and clinical plausibility.

Technology Category

Application Category

📝 Abstract
Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A$^3$Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A$^3$Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A$^3$MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A$^3$Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Improving attention distribution in medical vision-language models
Reducing hallucinated or inaccurate outputs in Med-LVLMs
Enhancing alignment without additional supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Attention Alignment Tuning framework
Zero-shot weak labels from SAM
Adaptive parameter selection via A$^3$MoE
🔎 Similar Papers
No similar papers found.