🤖 AI Summary
This work addresses the credit assignment challenge in large language model (LLM) reinforcement learning, particularly in recommendation scenarios where sparse sequence-level rewards and the absence of ground-truth labels necessitate inferring user intent from ambiguous natural language. To this end, we introduce the Shapley-Owen value—a cooperative game-theoretic solution—into LLM-based RL for the first time. By forming coalitions of semantically coherent units (e.g., attribute phrases or preference statements), our method enables fine-grained redistribution of sequence-level advantages to achieve paragraph-level credit assignment, without requiring additional parameterized value models. Integrated with implicit reward shaping, the approach learns directly from task feedback while preserving the optimal policy. Experiments on the Amazon ESCI and H&M Fashion datasets demonstrate significant performance gains over strong baselines and exhibit robust out-of-distribution generalization under unseen retrievers at test time.
📝 Abstract
Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards that create a credit assignment gap, obscuring which tokens drive success. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, a reasoning pattern rarely seen during pretraining. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens'marginal contributions to outcomes. Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, learning directly from task feedback without parametric value models. By forming coalitions of semantically coherent units (phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines, with notable test-time robustness to out-of-distribution retrievers unseen during training.