🤖 AI Summary
This work addresses the limitations of existing explainable AI (XAI) methods, which predominantly focus on static, single-timepoint models and struggle to interpret behavioral changes in large language models following interventions such as fine-tuning or reinforcement learning. To overcome this, we propose a comparative XAI framework—Δ-XAI—that shifts the explanatory focus from analyzing isolated models to characterizing the differences between pre- and post-intervention models. We formally define design principles for explaining behavioral shifts and develop multiple comparative explanation pipelines to enable systematic attribution of such changes. Experimental evaluations demonstrate that Δ-XAI effectively uncovers the mechanisms underlying intervention-induced behavioral drift in large models, offering a novel paradigm for understanding model evolution.
📝 Abstract
Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($\Delta$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $\Delta$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $\Delta$-XAI experiment.