LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of self-evolutionary reasoning in large language models, which are hindered by the scarcity of high-quality process data and compromised by existing intrinsic reward mechanisms susceptible to imitation bias, coarse-grained supervision, and distributional collapse. Framing self-alignment as a latent structure discovery problem, the authors propose a logic-consistency-based intrinsic reward decomposition framework. By integrating variational logical potentials with a multi-agent value decomposition protocol grounded in the Individual Global Max (IGM) principle, the method extracts fine-grained, noise-robust step-level supervision signals from model-generated trajectories. This approach is the first to achieve logical denoising and identification of high-value reasoning paths without external annotations. Experiments demonstrate significant improvements in both logical consistency of reasoning chains and task accuracy, while also uncovering a trade-off between these two objectives.
📝 Abstract
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/Reinhardmannn/LC-ERD.
Problem

Research questions and friction points this paper is trying to address.

Label Noise
Coarse-Grained Supervision
Distributional Collapse
Self-Alignment
Reasoning Evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Logic Expertise
Endogenous Reward Decomposition
Variational Logic Potential
Multi-Agent Value Decomposition
Self-Evolving Reasoning