Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of existing explainable AI (XAI) methods, which predominantly focus on static, single-timepoint models and struggle to interpret behavioral changes in large language models following interventions such as fine-tuning or reinforcement learning. To overcome this, we propose a comparative XAI framework—Δ-XAI—that shifts the explanatory focus from analyzing isolated models to characterizing the differences between pre- and post-intervention models. We formally define design principles for explaining behavioral shifts and develop multiple comparative explanation pipelines to enable systematic attribution of such changes. Experimental evaluations demonstrate that Δ-XAI effectively uncovers the mechanisms underlying intervention-induced behavioral drift in large models, offering a novel paradigm for understanding model evolution.

Technology Category

Application Category

📝 Abstract

Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($\Delta$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $\Delta$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $\Delta$-XAI experiment.

Problem

Research questions and friction points this paper is trying to address.

behavioral shifts

explainable AI

large language models

comparative explanation

model interventions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative XAI

behavioral shifts

large language models