Learn to Rank: Visual Attribution by Learning Importance Ranking

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing visual attribution methods struggle to simultaneously achieve efficiency, causal fidelity, and fine-grained interpretability. This work proposes a novel approach that directly optimizes deletion and insertion metrics as learning objectives by reformulating non-differentiable ranking operations into a differentiable permutation learning problem via Gumbel-Sinkhorn relaxation, enabling end-to-end pixel-level attribution training. By integrating attribution-guided perturbations with gradient refinement, the method consistently yields quantitative improvements across multiple models. Notably, on Vision Transformers, it produces sharper, boundary-aligned fine-grained explanation maps, effectively overcoming the longstanding trade-off between computational efficiency and explanatory granularity inherent in conventional approaches.

Technology Category

Application Category

📝 Abstract

Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model's prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.

Problem

Research questions and friction points this paper is trying to address.

visual attribution

interpretability

learn to rank

vision transformers

attribution maps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learn to Rank

Visual Attribution

Differentiable Sorting