🤖 AI Summary
Large language models (LLMs) suffer from unpredictability, opacity, and limited controllability. Method: This paper introduces “representation engineering”—a novel paradigm that identifies and edits semantic concept directions (e.g., honesty, harmfulness) in high-level representation spaces via contrastive input probing, enabling interpretable and intervention-based behavioral control. Contribution/Results: We formally define the paradigm’s objectives, scope, and methodology, rigorously distinguishing it from mechanistic interpretability, prompt engineering, and fine-tuning. We propose a unified framework integrating contrastive analysis, concept-level representation editing, high-dimensional causal intervention, and interpretability evaluation. This framework supports controllable, safe, and dynamically adaptive LLM governance, reveals critical challenges—including performance degradation and controllability collapse—and charts a technical pathway toward predictable, secure, and personalized LLMs.
📝 Abstract
Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.