Causally-Enhanced Reinforcement Policy Optimization

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) in reinforcement learning often rely on spurious shortcuts, leading to unfaithful reasoning and poor causal robustness. This paper proposes a plug-and-play reward shaping framework that enforces causal coherence along the “prompt → reasoning → answer” pathway to mitigate reward hacking and spurious reasoning. Its core contributions are: (1) internal influence estimation via Jacobian sensitivity analysis; (2) a counterfactual augmentation mechanism; and (3) a Minkowski power-mean aggregator that jointly optimizes accuracy and causal coherence signals. The method requires no architectural modifications and is compatible with standard RLHF algorithms such as PPO and GRPO. Evaluated on four benchmarks, it achieves an average accuracy gain of +5.49% (up to +9.58%), significantly improving robustness against causal inversions and lightweight counterfactual edits, while preserving baseline-level task performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.

Problem

Research questions and friction points this paper is trying to address.

Addresses unfaithful reasoning in LLMs via causal coherence

Counters reward hacking with differentiable causal pathway modeling

Improves robustness to causal perturbations while maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Jacobian-based sensitivities to estimate causal influence

Hardens signals counterfactually to suppress nuisance cues

Combines coherence score with accuracy via Minkowski combiner

🔎 Similar Papers

No similar papers found.

Authors to Follow