Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

📅 2026-02-12
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to effectively compare internal representations across large language models with different architectures, hindering the discovery of potentially safety-critical behaviors in newly developed models. This work addresses this challenge by extending the Crosscoder framework to enable unsupervised cross-architectural contrastive analysis and introduces a Dedicated Feature Crosscoder (DFC) architecture designed to more precisely disentangle and identify behaviorally distinctive features unique to each model. Experiments conducted on models such as Qwen3, Llama3.1, and GPT-OSS successfully uncover meaningful behavioral differences—including political bias and copyright refusal mechanisms—in a fully unsupervised manner, thereby demonstrating the effectiveness and practical utility of the proposed approach.

Technology Category

Application Category

📝 Abstract
Model diffing, the process of comparing models'internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.
Problem

Research questions and friction points this paper is trying to address.

model diffing
cross-architecture
large language models
unsupervised discovery
behavioral differences
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-architecture model diffing
crosscoders
Dedicated Feature Crosscoders
unsupervised feature discovery
LLM behavioral analysis