🤖 AI Summary
Conversational recommendation systems (CRS) face significant challenges in low-resource settings, including scarcity of domain-specific dialogue data, high annotation costs, and privacy constraints. To address these, we propose the first active learning–driven dialogue synthesis framework tailored for black-box large language models (LLMs), requiring neither fine-tuning nor access to internal LLM parameters. Leveraging prompt engineering, our method integrates heterogeneous non-dialogue data—such as item metadata, user reviews, and collaborative signals—to enable high-informativeness seed selection and semantic-consistent, structured dialogue generation. Empirically, the approach substantially improves zero-shot LLM-based CRS performance in sparse-data regimes, enhancing both recommendation accuracy and dialogue coherence. Moreover, it enables lightweight supervised models trained on synthesized data to approach the performance of fully supervised baselines. This work constitutes the first demonstration of high-quality CRS construction without any human-annotated dialogue data.
📝 Abstract
Conversational recommender systems (CRS) typically require extensive domain-specific conversational datasets, yet high costs, privacy concerns, and data-collection challenges severely limit their availability. Although Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities, practical applications often favor smaller, internally managed recommender models due to scalability, interpretability, and data privacy constraints, especially in sensitive or rapidly evolving domains. However, training these smaller models effectively still demands substantial domain-specific conversational data, which remains challenging to obtain. To address these limitations, we propose an active data augmentation framework that synthesizes conversational training data by leveraging black-box LLMs guided by active learning techniques. Specifically, our method utilizes publicly available non-conversational domain data, including item metadata, user reviews, and collaborative signals, as seed inputs. By employing active learning strategies to select the most informative seed samples, our approach efficiently guides LLMs to generate synthetic, semantically coherent conversational interactions tailored explicitly to the target domain. Extensive experiments validate that conversational data generated by our proposed framework significantly improves the performance of LLM-based CRS models, effectively addressing the challenges of building CRS in no- or low-resource scenarios.