🤖 AI Summary
This work addresses the lack of effective evaluation for the steerability of natural language recommendation systems under fine-grained and diverse user instructions. We propose SteerEval, a novel evaluation framework that establishes the first benchmark encompassing multi-dimensional interventions—including content type, safety constraints, and stylistic preferences—to systematically assess how recommendation models grounded in natural language user profiles respond to complex editing instructions. Experimental results demonstrate that current models perform adequately on common attributes but exhibit significant limitations when handling niche or composite user intents. Beyond exposing critical bottlenecks in existing approaches, this study provides empirical insights and practical guidelines for the future design of controllable and instruction-following recommender systems.
📝 Abstract
Natural-language user profiles have recently attracted attention not only for improved interpretability, but also for their potential to make recommender systems more steerable. By enabling direct editing, natural-language profiles allow users to explicitly articulate preferences that may be difficult to infer from past behavior. However, it remains unclear whether current natural-language-based recommendation methods can follow such steering commands. While existing steerability evaluations have shown some success for well-recognized item attributes (e.g., movie genres), we argue that these benchmarks fail to capture the richer forms of user control that motivate steerable recommendations. To address this gap, we introduce SteerEval, an evaluation framework designed to measure more nuanced and diverse forms of steerability by using interventions that range from genres to content-warning for movies. We assess the steerability of a family of pretrained natural-language recommenders, examine the potential and limitations of steering on relatively niche topics, and compare how different profile and recommendation interventions impact steering effectiveness. Finally, we offer practical design suggestions informed by our findings and discuss future steps in steerable recommender design.