Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Current AI interpretability methods are fragmented and lack a unified theoretical foundation. Method: This paper proposes the first unified analytical framework spanning three attribution paradigms—feature-, data-, and model-component-level attribution. By rigorously establishing the mathematical equivalence among perturbation analysis, gradient backpropagation, and linear approximations (e.g., Taylor expansions), it reveals their shared underlying mechanism: local sensitivity modeling. Contribution/Results: The framework standardizes terminology, aligns conceptual definitions, and unifies evaluation criteria—thereby significantly enhancing method interpretability, transferability, and reusability. It lowers entry barriers for newcomers while enabling advanced applications such as model editing, controllable steering, and AI governance. As a foundational contribution, it provides both theoretical grounding and practical scaffolding for next-generation interpretable AI systems.

Technology Category

Application Category

📝 Abstract

The increasing complexity of AI systems has made understanding their behavior a critical challenge. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods across three domains and present a unified view to demonstrate that these seemingly distinct methods employ similar approaches, such as perturbations, gradients, and linear approximations, differing primarily in their perspectives rather than core techniques. Our unified perspective enhances understanding of existing attribution methods, identifies shared concepts and challenges, makes this field more accessible to newcomers, and highlights new directions not only for attribution and interpretability but also for broader AI research, including model editing, steering, and regulation.

Problem

Research questions and friction points this paper is trying to address.

AI Decision Understanding

Unified Method

Simplification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Perspective

Explainable AI

Integrated Methods

🔎 Similar Papers

The FIX Benchmark: Extracting Features Interpretable to eXperts