Learn to Rank: Visual Attribution by Learning Importance Ranking

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual attribution methods struggle to simultaneously achieve efficiency, causal fidelity, and fine-grained interpretability. This work proposes a novel approach that directly optimizes deletion and insertion metrics as learning objectives by reformulating non-differentiable ranking operations into a differentiable permutation learning problem via Gumbel-Sinkhorn relaxation, enabling end-to-end pixel-level attribution training. By integrating attribution-guided perturbations with gradient refinement, the method consistently yields quantitative improvements across multiple models. Notably, on Vision Transformers, it produces sharper, boundary-aligned fine-grained explanation maps, effectively overcoming the longstanding trade-off between computational efficiency and explanatory granularity inherent in conventional approaches.
📝 Abstract
Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model's prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.
Problem

Research questions and friction points this paper is trying to address.

visual attribution
interpretability
learn to rank
vision transformers
attribution maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learn to Rank
Visual Attribution
Differentiable Sorting
Gumbel-Sinkhorn
Permutation Learning
🔎 Similar Papers
No similar papers found.
D
David Schinagl
Institute of Visual Computing, Graz University of Technology
C
Christian Fruhwirth-Reisinger
Institute of Visual Computing, Graz University of Technology
A
Alexander Prutsch
Institute of Visual Computing, Graz University of Technology
Samuel Schulter
Samuel Schulter
Amazon AGI
Computer VisionMachine Learning
Horst Possegger
Horst Possegger
Senior Scientist, Graz University of Technology
Computer VisionMachine LearningVisual PerceptionPattern Recognition