Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study investigates the transferability and universality of concept representations across large language models (LLMs). Method: Inspired by Platonic idealism, we propose a cross-model concept alignment framework that models concept representations in the latent spaces of different LLMs via linear mappings, enabling extraction, alignment, and reuse of steering vectors across models. Contribution/Results: We empirically establish, for the first time, the strong linear alignability of concept representations across LLMs and uncover a “weak-to-strong” transfer principle: steering vectors extracted from smaller models effectively control larger models’ behavior. Our method demonstrates significant improvements over baselines in alignment accuracy, cross-model behavioral controllability, and safe, controllable text generation across multiple mainstream LLMs. These findings introduce a novel paradigm for inter-model knowledge transfer, lightweight intervention, and controllable AI.

Technology Category

Application Category

📝 Abstract

Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Concept Representation

Inter-model Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-model Concept Transfer

Linear Transformation

Language Model Interoperability

🔎 Similar Papers

Large Language Models Are Cross-Lingual Knowledge-Free Reasoners