Manipulating Feature Visualizations with Gradient Slingshots

📅 2024-01-11
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical credibility vulnerability in feature visualization (FV) for deep neural network interpretability: FV outputs are susceptible to stealthy manipulation, leading to erroneous attribution of neuron semantics. To address this, we propose the first model-architecture-agnostic targeted FV manipulation method. Our approach integrates gradient redirection (via Slingshot optimization), adversarial latent-space perturbations, and neuron-activation-constrained regularization to achieve “semantic masking”—i.e., seamless substitution of a target neuron’s original FV explanation with an arbitrary user-specified semantic concept. Experiments across CNNs and Vision Transformers demonstrate successful concealment of functionally critical neurons: model accuracy degrades by less than 0.3%, yet FV-based auditing yields a 92% false-negative rate in detecting manipulated neurons. These results underscore the fragility of prevailing FV techniques and establish a new paradigm for robust model auditing and interpretability governance.

Technology Category

Application Category

📝 Abstract
Deep Neural Networks (DNNs) are capable of learning complex and versatile representations, however, the semantic nature of the learned concepts remains unknown. A common method used to explain the concepts learned by DNNs is Feature Visualization (FV), which generates a synthetic input signal that maximally activates a particular neuron in the network. In this paper, we investigate the vulnerability of this approach to adversarial model manipulations and introduce a novel method for manipulating FV without significantly impacting the model's decision-making process. The key distinction of our proposed approach is that it does not alter the model architecture. We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons by masking the original explanations of neurons with chosen target explanations during model auditing.
Problem

Research questions and friction points this paper is trying to address.

Manipulating Feature Visualizations without altering model architecture
Exposing vulnerabilities in trustworthiness of FV explanations
Proposing defense against fabricated FV explanations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulates FV with Gradient Slingshots method
Shapes trajectories in off-distribution regions
Proposes defense against fabricated FV explanations
🔎 Similar Papers
No similar papers found.