Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Relying solely on the correctness of final answers as a reward often leads to unreliable reasoning processes and limited downstream utility. This work proposes TraceLift, a novel framework that introduces “executor-anchored rewards,” treating reasoning trajectories as intermediate artifacts intended for consumption by downstream executors. By jointly optimizing trajectory quality through a rule-based reasoning reward model and measurable improvements in executor performance, TraceLift ensures that generated traces are both logically sound and practically useful. To support fine-grained supervision, we construct the TRACELIFT-GROUPS dataset, comprising high-quality reference trajectories alongside their locally perturbed variants for the same problems. Experiments on mathematical and code generation tasks demonstrate that our approach significantly outperforms training strategies based exclusively on execution outcomes, confirming that effective reasoning trajectories must balance formal coherence with tangible benefits to downstream models.

📝 Abstract

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.

Problem

Research questions and friction points this paper is trying to address.

reasoning trace

reward design

planner-executor framework

faithfulness

intermediate reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

executor-grounded reward

reasoning trace

planner-executor framework