π€ AI Summary
Large language models (LLMs) in reinforcement learning often rely on spurious shortcuts, leading to unfaithful reasoning and poor causal robustness. This paper proposes a plug-and-play reward shaping framework that enforces causal coherence along the βprompt β reasoning β answerβ pathway to mitigate reward hacking and spurious reasoning. Its core contributions are: (1) internal influence estimation via Jacobian sensitivity analysis; (2) a counterfactual augmentation mechanism; and (3) a Minkowski power-mean aggregator that jointly optimizes accuracy and causal coherence signals. The method requires no architectural modifications and is compatible with standard RLHF algorithms such as PPO and GRPO. Evaluated on four benchmarks, it achieves an average accuracy gain of +5.49% (up to +9.58%), significantly improving robustness against causal inversions and lightweight counterfactual edits, while preserving baseline-level task performance.
π Abstract
Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.