Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Supervised learning (SL) for mathematical reasoning suffers from reliance on ground-truth answers and an inability to reflect on errors. Method: We propose Negative Feedback Tuning (NFT), a novel SL paradigm that requires neither external teachers nor reinforcement learning (RL) frameworks. NFT online constructs self-generated erroneous samples using binary verification signals, implicitly models a negative policy via shared parameters, and performs joint positive–negative supervised fine-tuning—enabling autonomous error identification and strategy refinement within the SL framework. Contribution/Results: We theoretically prove that NFT is equivalent to GRPO under strict on-policy conditions, bridging the methodological gap between SL and RL in binary-feedback learning. Experiments show that NFT significantly outperforms SL baselines (e.g., rejection sampling) on 7B and 32B models, matching or exceeding state-of-the-art RL methods including GRPO and DAPO.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

Problem

Research questions and friction points this paper is trying to address.

Bridging supervised and reinforcement learning in math reasoning

Enabling LLMs to reflect on failures without external teachers

Improving model performance using negative feedback in training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Negative-aware Fine-Tuning enables autonomous self-improvement

Implicit negative policy models self-generated failures

Direct policy optimization on all LLM generations

🔎 Similar Papers

No similar papers found.