🤖 AI Summary
This work addresses the limitations of attribution maps generated by Vision Transformers (ViTs), which are often compromised by structured artifacts stemming from patch embeddings and attention mechanisms, yielding only coarse-grained and unstable block-level explanations. To overcome this, the authors propose a gradient decomposition method tailored to the architectural characteristics of ViTs, introducing— for the first time—distribution-aware modeling to mathematically disentangle the local, equivariant, and stable input–output mapping components from structural noise. This approach effectively suppresses architecture-induced artifacts, substantially enhancing both the stability and spatial resolution of attributions, and thereby producing high-fidelity, pixel-level explanation maps.
📝 Abstract
Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.