On the Convergence of Moral Self-Correction in Large Language Models

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

The mechanisms and efficacy of intrinsic self-correction—where large language models (LLMs) iteratively refine outputs without external feedback, relying solely on internal moral concepts—remain poorly understood, particularly regarding why convergence occurs in moral reasoning. Method: We design multi-turn self-correction experiments grounded in moral reasoning tasks and employ Concept Activation Analysis (CAA) to probe internal representations across iterations. Contribution/Results: We empirically establish that intrinsic self-correction exhibits convergence: model performance stabilizes after 3–5 refinement rounds. CAA reveals that sustained moral instructions consistently activate domain-specific internal representations, progressively reducing output uncertainty. This process significantly improves both the moral appropriateness and logical consistency of responses. To our knowledge, this is the first systematic demonstration of convergence in intrinsic self-correction and its underlying neural representation dynamics. Our findings offer a novel, annotation-free pathway for enhancing LLM trustworthiness and optimizing model behavior through internal alignment mechanisms.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

Problem

Research questions and friction points this paper is trying to address.

Investigating convergence mechanisms in moral self-correction

Analyzing how intrinsic self-correction reduces model uncertainty

Revealing how moral concept activation stabilizes performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intrinsic self-correction uses internal knowledge for improvement

Multi-round interactions enable performance convergence in models

Consistent instructions activate moral concepts reducing uncertainty

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges