Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of insufficient exploration and compressed advantage signals in generative recommendation systems driven by reinforcement learning, which often stem from a mismatch between action probabilities and rewards. To overcome these issues, the authors propose the V-STAR framework, which integrates value-guided decoding (VED), tree-structured trajectory construction, and a novel Sibling-GRPO algorithm. The latter enhances decision-making by estimating relative advantages among sibling nodes, thereby focusing exploration on critical branching decisions, mitigating exploration bias, and improving advantage discriminability. Experimental results demonstrate that V-STAR significantly outperforms existing methods on both offline and online datasets under strict latency constraints, simultaneously achieving higher recommendation accuracy and greater candidate diversity.

Technology Category

Application Category

📝 Abstract
Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
Problem

Research questions and friction points this paper is trying to address.

generative recommendation
reinforcement learning
probability-reward mismatch
exploration
advantage compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Value-Guided Sampling
Tree-structured Advantage
Generative Recommendation
Sibling-GRPO
Exploration Efficiency
🔎 Similar Papers
No similar papers found.
J
Jie Jiang
Tencent Inc., China
Yangru Huang
Yangru Huang
Peking University
Z
Zeyu Wang
Tencent Inc., China
C
Changping Wang
Tencent Inc., China
Y
Yuling Xiong
Tencent Inc., China
Jun Zhang
Jun Zhang
Tencent
AI codecimage/video generationmedical image analysis
H
Huan Yu
Tencent Inc., China