Teaching Large Reasoning Models Effective Reflection

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the challenge that large reasoning models (LRMs) often produce invalid or superficial feedback during self-reflection, which fails to improve answer quality while incurring substantial computational overhead. To overcome this limitation, the authors propose a self-critique fine-tuning (SCFT) framework that relies solely on model-generated critiques and, for the first time, incorporates high-quality self-reflection as a reward signal in reinforcement learning (RLERR). This approach explicitly optimizes and internalizes an efficient self-correction mechanism by integrating rejection sampling with a reflection-based fine-tuning objective. Evaluated on the AIME2024 and AIME2025 benchmarks, the method significantly enhances both reasoning accuracy and the quality of self-reflection, outperforming current state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks, often by engaging in self-reflective behaviors such as self-critique and backtracking. However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer and incurring computation overhead. In this paper, we identify and address the problem of superficial reflection in LRMs. We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques. SCFT prompts models to critique their own outputs, filters high-quality critiques through rejection sampling, and fine-tunes the model using a critique-based objective. Building on this strong foundation, we further introduce Reinforcement Learning with Effective Reflection Rewards (RLERR). RLERR leverages the high-quality reflections initialized by SCFT to construct reward signals, guiding the model to internalize the self-correction process via reinforcement learning. Experiments on two challenging benchmarks, AIME2024 and AIME2025, show that SCFT and RLERR significantly improve both reasoning accuracy and reflection quality, outperforming state-of-the-art baselines. All data and codes are available at https://github.com/wanghanbinpanda/SCFT.

Problem

Research questions and friction points this paper is trying to address.

Large Reasoning Models

superficial reflection

self-critique

reflective reasoning

reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Critique Fine-Tuning

Reinforcement Learning

Effective Reflection