Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that self-evolving large language models are prone to performance degradation during unsupervised training due to erroneous self-generated feedback. To mitigate this issue without relying on external verifiers, the authors propose COSE, a novel framework that leverages the model’s intrinsic confidence as a lightweight uncertainty signal. COSE introduces a confidence-weighted proximal policy optimization (PPO) update mechanism and a confidence-prioritized experience replay strategy to suppress training noise. Experiments on Qwen and Llama model families demonstrate that COSE significantly outperforms existing baselines across 19 held-out benchmarks, achieving state-of-the-art results in general reasoning and mathematical tasks while remaining competitive in code generation.
πŸ“ Abstract
Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.
Problem

Research questions and friction points this paper is trying to address.

self-evolving LLMs
training-signal challenge
uncertain feedback
self-judgment errors
reasoning domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolution
confidence-based learning
uncertainty-aware training
PPO
large language models
πŸ”Ž Similar Papers
No similar papers found.