🤖 AI Summary
The lack of large-scale, publicly available natural language (NL) user interest profiling testbeds hinders research on explainable scientific literature recommendation. To address this, we introduce SynSciRec—the first synthetic academic recommendation dataset explicitly designed for NL-based user profiling—generated automatically from real publication histories of researchers to yield high-quality, controllable NL interest descriptions. Methodologically, we integrate sparse retrieval, dense retrieval, and LLM-driven re-ranking to systematically evaluate diverse recommendation paradigms. Experimental results show that state-of-the-art methods achieve comparable overall performance yet produce complementary recommendations, underscoring the necessity of ensemble strategies. The dataset is publicly released, establishing a benchmark platform and foundational resource for explainable recommendation research.
📝 Abstract
The use of natural language (NL) user profiles in recommender systems offers greater transparency and user control compared to traditional representations. However, there is scarcity of large-scale, publicly available test collections for evaluating NL profile-based recommendation. To address this gap, we introduce SciNUP, a novel synthetic dataset for scholarly recommendation that leverages authors' publication histories to generate NL profiles and corresponding ground truth items. We use this dataset to conduct a comparison of baseline methods, ranging from sparse and dense retrieval approaches to state-of-the-art LLM-based rerankers. Our results show that while baseline methods achieve comparable performance, they often retrieve different items, indicating complementary behaviors. At the same time, considerable headroom for improvement remains, highlighting the need for effective NL-based recommendation approaches. The SciNUP dataset thus serves as a valuable resource for fostering future research and development in this area.