🤖 AI Summary
This work addresses the limitations of existing generative recommendation models, which lack strong supervision on user intent and face challenges when directly deploying large language models (LLMs) due to misalignment between semantic signals and business objectives, as well as high inference costs. To overcome these issues, we propose S-GRec, a framework that decouples an online lightweight generator from an offline LLM-driven two-stage personalized semantic judge (PSJ). During training, S-GRec introduces interpretable aspect-level semantic supervision and employs an Asymmetric Advantage Policy Optimization (A2PO) mechanism that injects semantic signals only when they align with and enhance business goals. Experiments on public benchmarks and industrial systems demonstrate significant improvements in click-through rate (CTR), along with a statistically significant 1.19% increase in gross merchandise value (GMV) without requiring real-time LLM inference.
📝 Abstract
Generative recommendation models sequence generation to produce items end-to-end, but training from behavioral logs often provides weak supervision on underlying user intent. Although Large Language Models (LLMs) offer rich semantic priors that could supply such supervision, direct adoption in industrial recommendation is hindered by two obstacles: semantic signals can conflict with platform business objectives, and LLM inference is prohibitively expensive at scale. This paper presents S-GRec, a semantic-aware framework that decouples an online lightweight generator from an offline LLM-based semantic judge for train-time supervision. S-GRec introduces a two-stage Personalized Semantic Judge (PSJ) that produces interpretable aspect evidence and learns user-conditional aggregation from pairwise feedback, yielding stable semantic rewards. To prevent semantic supervision from deviating from business goals, Asymmetric Advantage Policy Optimization (A2PO) anchors optimization on business rewards (e.g., eCPM) and injects semantic advantages only when they are consistent. Extensive experiments on public benchmarks and a large-scale production system validate both effectiveness and scalability, including statistically significant gains in CTR and a 1.19\% lift in GMV in online A/B tests, without requiring real-time LLM inference.