Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

๐Ÿ“… 2025-01-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the transferability and universality of concept representations across large language models (LLMs). Method: Inspired by Platonic idealism, we propose a cross-model concept alignment framework that models concept representations in the latent spaces of different LLMs via linear mappings, enabling extraction, alignment, and reuse of steering vectors across models. Contribution/Results: We empirically establish, for the first time, the strong linear alignability of concept representations across LLMs and uncover a โ€œweak-to-strongโ€ transfer principle: steering vectors extracted from smaller models effectively control larger modelsโ€™ behavior. Our method demonstrates significant improvements over baselines in alignment accuracy, cross-model behavioral controllability, and safe, controllable text generation across multiple mainstream LLMs. These findings introduce a novel paradigm for inter-model knowledge transfer, lightweight intervention, and controllable AI.

Technology Category

Application Category

๐Ÿ“ Abstract
Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Concept Representation
Inter-model Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-model Concept Transfer
Linear Transformation
Language Model Interoperability
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Youcheng Huang
Sichuan University, Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China
C
Chen Huang
Sichuan University, Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China
Duanyu Feng
Duanyu Feng
Sichuan University
Machine learningNumerical optimizationNature language processing
W
Wenqiang Lei
Sichuan University, Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China
Jiancheng Lv
Jiancheng Lv
University of Science and Technology of China
Operations ManagementMarketing