Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of training large language models (LLMs) on non-verifiable tasks—such as creative writing and ethical reasoning—where ground-truth labels are unavailable, and existing LLM-as-Judge approaches are hindered by evaluator bias and limited capability. To overcome these limitations, the paper introduces CoNL, a novel framework that integrates meta-evaluation into multi-agent self-play. In CoNL, multiple agents sharing a common policy engage in structured dialogue to generate responses, critique each other, and iteratively revise their outputs. Critically, the framework uses “whether criticism facilitates improvement in others” as a diagnostic reward signal, enabling the joint co-evolution of generation and evaluation capabilities without reliance on external annotations or human feedback. Experiments demonstrate that CoNL significantly outperforms self-rewarding baselines across five benchmarks while maintaining stable training dynamics.

Technology Category

Application Category

📝 Abstract
Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
Problem

Research questions and friction points this paper is trying to address.

non-verifiable learning
LLM-as-Judge
evaluation bias
meta-evaluation
ground-truth labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

meta-evaluation
self-play
non-verifiable learning
diagnostic reward
multi-agent conversation
🔎 Similar Papers
No similar papers found.