🤖 AI Summary
Conventional offline evaluation of recommender systems focuses on predicting users’ historically high-rated items, creating a fundamental misalignment with the true objective—predicting users’ latent watch intent. Method: This paper redefines evaluation by adopting “watch intent” as the primary target. We construct the MovieLens-32M-Intent extension: a new benchmark dataset where personalized watch-intent signals are collected via user surveys and behavioral annotation, enabling the first use of users’ explicit, active viewing intentions toward candidate items as supervision labels. To mitigate popularity bias, we design a multi-algorithm result aggregation and debiased labeling pipeline. Contribution/Results: Experiments show that popularity-based baselines perform worst under this paradigm—demonstrating its superior alignment with the core purpose of recommendation. The dataset is publicly released, establishing a more realistic and behaviorally grounded evaluation benchmark for recommender systems.
📝 Abstract
Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. As a resource available for download, we offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the popularity bias issues created by using information retrieval effectiveness measures for the evaluation of recommender systems.