Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the training instability of reinforcement learning (RL) for honesty alignment in language models—particularly the collapse observed during multi-step deductive reasoning, where early negative rewards dominate policy updates. To mitigate this, we propose Anchor, a method that stabilizes RL by anchoring policy updates to ground-truth trajectories and integrating curriculum learning to construct controllable-difficulty graph-structured reasoning datasets (spanning linear algebra and logic). Additionally, perturbation-based generation of unanswerable samples strengthens the model’s “refusal” capability. Technically, Anchor unifies verifiable-reward RL (RLVR), GRPO, and supervised fine-tuning. Experiments demonstrate that Anchor significantly improves training stability and consistently enhances both reasoning accuracy and honesty judgment across three distinct model families, outperforming the baseline RLVR method on all evaluated dimensions.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.

Problem

Research questions and friction points this paper is trying to address.

Stabilizes reinforcement learning for honesty alignment in language models

Prevents training collapse when negative rewards dominate early stages

Enables reliable deductive reasoning on both answerable and unanswerable queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchor method injects ground truth trajectories

Curriculum learning with controlled difficulty datasets

Reinforcement learning stabilized for honesty alignment

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting