🤖 AI Summary
Generative recommendation (GR) suffers from exposure bias during fine-tuning—caused by single-step supervised fine-tuning (SFT) or direct preference optimization (DPO)—which neglects unobserved but potentially positive items.
Method: This work introduces Generative Flow Networks (GFlowNets) to GR for the first time, framing recommendation as a multi-step sequential generation task. We design an adaptive trajectory sampler infused with collaborative filtering knowledge and a composite reward model to explicitly capture the distribution of unobserved positives. Additionally, we propose heuristic weighted sampling and knowledge distillation to enhance generalization.
Contribution/Results: Extensive experiments on two real-world datasets and two state-of-the-art GR backbone models demonstrate significant improvements in Recall@10 and NDCG@10. Our approach effectively mitigates exposure bias while improving model robustness and generalization capability.
📝 Abstract
Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remarkable success across a wide range of scenarios. The majority of existing research efforts primarily concentrate on developing powerful item tokenizers or advancing LLM decoding strategies to attain superior performance. However, the critical fine-tuning step in GR frameworks, which is essential for adapting LLMs to recommendation data, remains largely unexplored. Current approaches predominantly rely on either the next-token prediction loss of supervised fine-tuning (SFT) or recommendationspecific direct preference optimization (DPO) strategies. Both methods ignore the exploration of possible positive unobserved samples, which is commonly referred to as the exposure bias problem. To mitigate this problem, this paper treats the GR as a multi-step generation task and constructs a GFlowNets-based fine-tuning framework (GFlowGR). The proposed framework integrates collaborative knowledge from traditional recommender systems to create an adaptive trajectory sampler and a comprehensive reward model. Leveraging the diverse generation property of GFlowNets, along with sampling and heuristic weighting techniques, GFlowGR emerges as a promising approach to mitigate the exposure bias problem. Extensive empirical results on two real-world datasets and with two different GR backbones highlight the effectiveness and robustness of GFlowGR.