π€ AI Summary
This work addresses the challenge of cross-model comparison and transfer of behavioral directions across distinct large language model (LLM) families, which exhibit divergent behaviors due to architectural and training differences. The authors propose an Anchor Projection Framework that maps hidden representations from diverse models into a shared Anchor Coordinate Space (ACS), where behavioral directions are aggregated and subsequently reconstructed into the native representation space of a target modelβwithout requiring fine-tuning or task-specific extraction. This approach reveals, for the first time, a transferable common behavioral structure across multiple LLM families at the representation level. Using only a few source models and a small anchor pool, the method achieves strong cross-family transfer performance: on the LQMP model cluster, it attains 83% accuracy in ten-class classification and an average binary AUROC of 0.95, with behavioral interventions improving refusal rates by up to 0.46%, demonstrating both robustness and efficacy.
π Abstract
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.