🤖 AI Summary
This work addresses the challenge of credit assignment in generative recommender systems trained via reinforcement learning, where reliance on holistic matching rewards obscures the identification of erroneous reasoning steps. To resolve this, the authors propose the Step-Aligned Policy Optimization (SAPO) framework, which formulates recommendation as autoregressive generation of semantic identifiers (SIDs). SAPO introduces a grouped relative advantage estimation mechanism that computes step-specific advantage signals, enabling fine-grained credit assignment across individual reasoning steps. Evaluated on three real-world datasets, the method significantly outperforms existing baselines, with particularly notable gains under sparse feedback conditions. Moreover, SAPO enhances both training stability and recommendation accuracy, demonstrating its effectiveness in aligning policy updates with precise reasoning-level feedback.
📝 Abstract
Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.