An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates the intrinsic self-correction mechanism of language models in the absence of external feedback—specifically, how prompts drive interpretable shifts of hidden states within a linear representation space, enabling token alignment with latent semantic concepts. We propose the first concept-alignment-based linear representation interpretability framework and establish a mathematical model proving that the output distribution is centrally governed by alignment strength. Using linear algebraic analysis, interpretable modeling of hidden states, token-level alignment quantification, and experiments on Zephyr-7B-SFT for text detoxification, we find that self-correcting prompts significantly amplify the inner-product disparity between high- and low-toxicity token embeddings and the prompt-induced offset vector, thereby enhancing latent concept identification. Our core contribution is the revelation that prompt-induced linear representation alignment constitutes the fundamental mechanism underlying intrinsic self-correction.

Technology Category

Application Category

📝 Abstract

We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.

Problem

Research questions and friction points this paper is trying to address.

Explaining intrinsic self-correction in language models

Analyzing prompt-induced changes in hidden states

Investigating latent concept recognition via linear representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear representations explain self-correction performance gains

Prompt-induced shifts lie in linear span of concept vectors

Self-correction enhances latent concept recognition capability

🔎 Similar Papers

No similar papers found.