🤖 AI Summary
This study investigates whether identifiable, individualized entities corresponding to “minds” exist within large language models (LLMs). To this end, it introduces two novel conceptual perspectives—the virtual instance–agent view and the model–agent view—and integrates mechanistic interpretability techniques to systematically analyze internal LLM representations. Through attention flow analysis, agent vector modeling, and hypothesis testing on the structural properties of agent representation spaces, the work not only substantiates the plausibility of the virtual instance perspective but also establishes a new agent-based framework that significantly enhances explanatory power regarding the localization and individuation mechanisms of minds in LLMs.
📝 Abstract
The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.