Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

In reinforcement learning, AI-generated preference feedback often suffers from inconsistent judgments—such as preference cycles—leading to unstable training. To address this, we propose the first systematic framework for detecting and eliminating preference conflicts. We introduce the Conflict Detection Rate (CDR) as a quantitative metric for inconsistency and design the Deconflicted Graph Reward (DGR) mechanism: leveraging graph algorithms, DGR transforms the original preference graph into a directed acyclic graph (DAG), thereby generating logically consistent reward signals and purifying rewards prior to policy optimization. Experiments demonstrate that our approach significantly improves training stability and final performance, outperforming strong baselines across multiple benchmarks. This work provides the first empirical validation that logical consistency is a critical determinant of AI feedback quality.

Technology Category

Application Category

📝 Abstract

However, this method often faces judgment inconsistencies that can destabilize reinforcement learning. While prior research has focused on the accuracy of judgments, the critical issue of logical coherence especially issues such as preference cycles hasn't been fully addressed. To fill this gap, we introduce a comprehensive framework designed to systematically detect and resolve these inconsistencies during the reinforcement learning training process. Our framework includes two main contributions: first, the Conflict Detection Rate (CDR), a new metric that quantifies judgment conflicts, and second, Deconflicted Graph Rewards (DGR), a framework that purifies signals by removing cycles before policy optimization. DGR constructs preference graphs from the initial judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal that is compatible with any policy optimizer. Experimental results show that our framework significantly enhances training stability and model performance compared to strong baselines, establishing logical consistency as a crucial and now manageable dimension of AI feedback.

Problem

Research questions and friction points this paper is trying to address.

Addresses judgment inconsistencies in AI feedback systems

Resolves logical coherence issues like preference cycles

Stabilizes reinforcement learning through conflict detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Conflict Detection Rate metric for judgment conflicts

Develops Deconflicted Graph Rewards framework removing preference cycles

Transforms preference graphs into conflict-free Directed Acyclic Graphs

🔎 Similar Papers

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales