🤖 AI Summary
To address the weak interpretability of Vision Transformers (ViTs) in hyperspectral imaging (HSI), the inability of existing saliency methods to capture semantically relevant spectral cues, and the prohibitive computational cost of full-spectrum ViT inference, this paper proposes FOCUS—the first efficient spatial-spectral interpretability framework tailored for frozen ViTs. Its core innovations include: (1) class-specific spectral prompting to guide band-level attention, and (2) a learnable [SINK] token that absorbs redundant attention, enabling stable 3D saliency maps and spectral importance curves via a single forward pass—without backpropagation or backbone modification. Experiments demonstrate that FOCUS improves band-level IoU by 15%, reduces attention collapse by over 40%, achieves high alignment with expert annotations, and introduces less than 1% additional parameters—significantly advancing the practical deployment of interpretable HSI-ViT models.
📝 Abstract
Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.