🤖 AI Summary
To address inefficient preference learning in conversational recommendation—caused by insufficient exploration of key terms and rigid dialogue-triggering mechanisms—this paper proposes three novel algorithms: CLiSK, CLiME, and CLiSK-ME. First, it introduces smooth contextual modeling into exploration enhancement, enabling uncertainty-driven adaptive dialogue triggering. Second, it establishes a context-sensitive multi-armed bandit theoretical framework and rigorously derives a near-minimax-optimal regret bound of $O(sqrt{dT log T})$, proving its tightness via a matching lower bound $Omega(sqrt{dT})$. Empirical evaluation on both synthetic and real-world datasets demonstrates an average reduction of ≥14.6% in cumulative regret, significantly improving interactive efficiency and preference estimation accuracy.
📝 Abstract
Conversational recommender systems proactively query users with relevant"key terms"and leverage the feedback to elicit users' preferences for personalized recommendations. Conversational contextual bandits, a prevalent approach in this domain, aim to optimize preference learning by balancing exploitation and exploration. However, several limitations hinder their effectiveness in real-world scenarios. First, existing algorithms employ key term selection strategies with insufficient exploration, often failing to thoroughly probe users' preferences and resulting in suboptimal preference estimation. Second, current algorithms typically rely on deterministic rules to initiate conversations, causing unnecessary interactions when preferences are well-understood and missed opportunities when preferences are uncertain. To address these limitations, we propose three novel algorithms: CLiSK, CLiME, and CLiSK-ME. CLiSK introduces smoothed key term contexts to enhance exploration in preference learning, CLiME adaptively initiates conversations based on preference uncertainty, and CLiSK-ME integrates both techniques. We theoretically prove that all three algorithms achieve a tighter regret upper bound of $O(sqrt{dTlog{T}})$ with respect to the time horizon $T$, improving upon existing methods. Additionally, we provide a matching lower bound $Omega(sqrt{dT})$ for conversational bandits, demonstrating that our algorithms are nearly minimax optimal. Extensive evaluations on both synthetic and real-world datasets show that our approaches achieve at least a 14.6% improvement in cumulative regret.