🤖 AI Summary
This work addresses limitations in current large language models (LLMs) for recommendation reranking, particularly their underutilized reasoning capabilities and reliance on non-semantic IDs that hinder scalability. To overcome these issues, the authors propose GR2, an end-to-end generative reasoning reranking framework trained through three stages: semantic ID encoding, supervised fine-tuning with high-quality reasoning trajectories, and reinforcement learning optimized via a conditionally verifiable reward mechanism. GR2 innovatively integrates high-order reasoning with reinforcement learning, enhances industrial scalability through semantic IDs, and incorporates an anti-gaming reward design. Experiments on two real-world datasets show that GR2 outperforms the state-of-the-art method OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablation studies further confirm the effectiveness of both the reasoning trajectories and the proposed reward mechanism.
📝 Abstract
Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.