๐ค AI Summary
This work addresses the challenge of accurately capturing semantic objectives in code repair tasks under weak feedback settings, where existing reinforcement learning methods often fall short. To enhance the Group Relative Policy Optimization (GRPO) algorithm, the authors propose a tripartite signal reshaping mechanism: a compilation-semantic hierarchical reward to recover semantic ordering, step-level process scores for intra-trajectory credit assignment, and failure-aware trajectory governance to ensure comparability within groupsโall while preserving GRPOโs group normalization structure. Experimental results demonstrate that this approach significantly improves performance, raising the strict compilation and semantic accuracy from 0.385 to 0.535 and reducing the average evaluation steps from 23.50 to 17.02, outperforming binary-reward and token-level distillation baselines.
๐ Abstract
Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO's group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model's zero-shot $0.385$ to $0.535$. Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from $0.48$ to $0.53$ and reduces average evaluation steps from $23.50$ to $17.02$. As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.