Exemplar Partitioning for Mechanistic Interpretability

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work proposes an unsupervised Example Partitioning (EP) method to efficiently construct interpretable and computationally lightweight feature dictionaries for analyzing the internal mechanisms of large language models. EP leverages real samples from streaming activation data as Voronoi region anchors, eliminating the need to predefine dictionary size, and uniquely employs observed activations directly as both intervention directions and region representatives. This enables feature alignment across layers, models, and training stages while inherently supporting out-of-distribution detection. By integrating distance-threshold leader clustering with causal interventions, EP achieves superior performance on Gemma-2-2B, surpassing GemmaScope SAE’s AxBench AUROC (0.881) at only one-thousandth of the computational cost, retaining 97% probe accuracy, and exhibiting high consistency with SAE features in 20% of its regions.

📝 Abstract

We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^{3}\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim 20\%$ of EP regions match an SAE feature at $F_{1} > 0.5$, and EP one-hot probes retain $\sim 97\%$ of raw-activation probe accuracy at $\ell_{0} = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_{1}$ reaches mean AUROC $0.881$, $+0.126$ over the canonical GemmaScope SAE leaderboard entry and within $0.030$ of SAE-A's $0.911$, at $\sim 10^{3}\times$ less build compute.

Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability

feature dictionary

activation space

large language models

unsupervised interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exemplar Partitioning

Mechanistic Interpretability

Voronoi Partition