SimAug: Enhancing Recommendation with Pretrained Language Models for Dense and Balanced Data Augmentation

📅 2025-05-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address performance degradation and fairness bias in collaborative filtering caused by sparse user-item interactions and long-tailed item distributions, this paper proposes a lightweight, plug-and-play data augmentation method leveraging pretrained language models (PLMs). Specifically, it employs BERT or LLaMA to extract semantic representations from item textual descriptions and generates high-confidence pseudo-interactions via embedding similarity—enabling both positive sample expansion and semantic-aware negative sampling—without requiring model fine-tuning or architectural modifications. This work is the first to directly exploit PLM-driven text semantic similarity for collaborative filtering data augmentation. Extensive experiments across nine public benchmark datasets demonstrate that the method achieves an average 4.2% improvement in Recall@K and reduces the Gini coefficient by 18.7%, thereby significantly enhancing both recommendation utility and fairness.

Technology Category

Application Category

📝 Abstract

Deep Neural Networks (DNNs) are extensively used in collaborative filtering due to their impressive effectiveness. These systems depend on interaction data to learn user and item embeddings that are crucial for recommendations. However, the data often suffers from sparsity and imbalance issues: limited observations of user-item interactions can result in sub-optimal performance, and a predominance of interactions with popular items may introduce recommendation bias. To address these challenges, we employ Pretrained Language Models (PLMs) to enhance the interaction data with textual information, leading to a denser and more balanced dataset. Specifically, we propose a simple yet effective data augmentation method (SimAug) based on the textual similarity from PLMs, which can be seamlessly integrated to any systems as a lightweight, plug-and-play component in the pre-processing stage. Our experiments across nine datasets consistently demonstrate improvements in both utility and fairness when training with the augmented data generated by SimAug. The code is available at https://github.com/YuyingZhao/SimAug.

Problem

Research questions and friction points this paper is trying to address.

Addressing data sparsity in collaborative filtering systems

Mitigating recommendation bias from popular item dominance

Enhancing interaction data using pretrained language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses PLMs for dense balanced data augmentation

Integrates SimAug as plug-and-play component

Improves utility and fairness via textual similarity

🔎 Similar Papers

No similar papers found.