๐ค AI Summary
This study investigates the transferability and universality of concept representations across large language models (LLMs).
Method: Inspired by Platonic idealism, we propose a cross-model concept alignment framework that models concept representations in the latent spaces of different LLMs via linear mappings, enabling extraction, alignment, and reuse of steering vectors across models.
Contribution/Results: We empirically establish, for the first time, the strong linear alignability of concept representations across LLMs and uncover a โweak-to-strongโ transfer principle: steering vectors extracted from smaller models effectively control larger modelsโ behavior. Our method demonstrates significant improvements over baselines in alignment accuracy, cross-model behavioral controllability, and safe, controllable text generation across multiple mainstream LLMs. These findings introduce a novel paradigm for inter-model knowledge transfer, lightweight intervention, and controllable AI.
๐ Abstract
Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.