Discourse Heuristics For Paradoxically Moral Self-Correction

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work identifies a dual paradox in large language models’ (LLMs) moral self-correction: corrections remain superficial, and diagnosed moral inconsistencies resist precise attribution. To address this, we propose a discourse-structure–informed heuristic modeling paradigm, constructing a fine-grained instruction-tuning dataset that systematically characterizes how discourse heuristics—such as agent responsibility attribution and consequence emphasis—modulate self-correction capability. Multi-scale generalization evaluation reveals that such heuristics significantly enhance moral correction in smaller models, yet their efficacy rapidly diminishes in larger models and complex scenarios, exposing an intrinsic tension between self-diagnosis and self-correction capabilities. This study pioneers the integration of rigorous discourse analysis into LLM moral alignment research, offering a novel structural framework for diagnosing inherent limitations in model moral reasoning.

Technology Category

Application Category

📝 Abstract

Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

Problem

Research questions and friction points this paper is trying to address.

LLMs' moral self-correction operates superficially despite evidence

LLMs struggle to identify causes of moral inconsistency

Heuristic shortcuts in discourse constructions lead to correction inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing discourse constructions in fine-tuning corpora

Leveraging heuristics of curated datasets for improvement

Addressing generalization challenges in moral self-correction

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges