Can LLM-Driven Hard Negative Sampling Empower Collaborative Filtering? Findings and Potentials

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This paper addresses key limitations of conventional negative sampling in collaborative filtering (CF)—namely, noise injection, insufficient semantic discriminability, and weak behavioral constraints—by proposing a semantic negative sampling paradigm. Methodologically, it introduces HNLMRec, a behaviorally supervised large language model (LLM) fine-tuning framework that jointly integrates user profiling, semantic hard negative sampling, and embedding alignment. Its core contributions are threefold: (1) the first systematic formalization of *semantic hard negatives*; (2) the design and implementation of HNLMRec; and (3) empirical validation across multiple public benchmarks, where it consistently outperforms state-of-the-art baselines—achieving an average 12.6% improvement in Recall@20 and effectively mitigating data sparsity, popularity bias, and false hard negative sampling issues. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Hard negative samples can accelerate model convergence and optimize decision boundaries, which is key to improving the performance of recommender systems. Although large language models (LLMs) possess strong semantic understanding and generation capabilities, systematic research has not yet been conducted on how to generate hard negative samples effectively. To fill this gap, this paper introduces the concept of Semantic Negative Sampling and exploreshow to optimize LLMs for high-quality, hard negative sampling. Specifically, we design an experimental pipeline that includes three main modules, profile generation, semantic negative sampling, and semantic alignment, to verify the potential of LLM-driven hard negative sampling in enhancing the accuracy of collaborative filtering (CF). Experimental results indicate that hard negative samples generated based on LLMs, when semantically aligned and integrated into CF, can significantly improve CF performance, although there is still a certain gap compared to traditional negative sampling methods. Further analysis reveals that this gap primarily arises from two major challenges: noisy samples and lack of behavioral constraints. To address these challenges, we propose a framework called HNLMRec, based on fine-tuning LLMs supervised by collaborative signals. Experimental results show that this framework outperforms traditional negative sampling and other LLM-driven recommendation methods across multiple datasets, providing new solutions for empowering traditional RS with LLMs. Additionally, we validate the excellent generalization ability of the LLM-based semantic negative sampling method on new datasets, demonstrating its potential in alleviating issues such as data sparsity, popularity bias, and the problem of false hard negative samples. Our implementation code is available at https://github.com/user683/HNLMRec.

Problem

Research questions and friction points this paper is trying to address.

How to generate effective hard negative samples using LLMs

Improving collaborative filtering accuracy with LLM-driven negative sampling

Addressing noisy samples and behavioral constraints in LLM-based recommendations

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven hard negative sampling for CF

HNLMRec framework with fine-tuned LLMs

Semantic alignment enhances recommendation accuracy

🔎 Similar Papers

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation