InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

📅 2026-01-20
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the credit assignment problem in large language models (LLMs) trained via reinforcement learning, where sparse, outcome-based rewards often suppress correct intermediate reasoning steps. To mitigate this, the authors propose Intervational Training (InT), a novel paradigm that introduces a self-initiated, single-step intervention mechanism: during inference, the model identifies the first erroneous step in its reasoning trajectory and generates a precise correction guided by reference answers and verification signals. InT integrates supervised fine-tuning with subsequent reinforcement learning to enable fine-grained credit assignment. Evaluated on the IMO-AnswerBench benchmark, InT improves the accuracy of a 4B-parameter model by nearly 14%, surpassing larger open-source models such as gpt-oss-20b.

Technology Category

Application Category

📝 Abstract
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
reinforcement learning
large language models
reasoning
outcome-reward
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intervention Training
credit assignment
reasoning trace correction
supervised fine-tuning
outcome-reward RL
M
Matthew Y. R. Yang
Carnegie Mellon University
H
Hao Bai
University of Illinois Urbana-Champaign
Ian Wu
Ian Wu
Carnegie Mellon University
Machine Learning
G
Gene Yang
Carnegie Mellon University
A
Amrith Rajagopal Setlur
Carnegie Mellon University
Aviral Kumar
Aviral Kumar
Carnegie Mellon University
AIReinforcement Learning