ORBIT - Open Recommendation Benchmark for Reproducible Research with Hidden Tests

πŸ“… 2025-10-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current recommender system research faces two critical bottlenecks: (1) widely used datasets poorly reflect authentic user behavior, and (2) heterogeneous evaluation protocols hinder cross-study comparability. To address these issues, we propose ORBITβ€”the first unified, real-world-oriented benchmark for web recommendation. ORBIT comprises: (1) a novel recommendation task built upon high-quality browsing sequences from ClueWeb, accompanied by a hidden test set designed to emulate operational conditions; (2) a standardized evaluation framework, reproducible data splits, and synthetic data generation techniques to support controlled experimentation; and (3) the first integration of instruction-tuned large language models (LLMs) as baselines, enabling rigorous assessment of generalization capability. Extensive experiments across 12 state-of-the-art models reveal a substantial performance gap between public-benchmark results and real-world effectiveness; conventional methods exhibit limited generalizability in large-scale web recommendation, whereas LLM-augmented approaches demonstrate marked promise.

Technology Category

Application Category

πŸ“ Abstract
Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This paper introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models' generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. ORBIT benchmark, leaderboard, and codebase are available at https://www.open-reco-bench.ai.
Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable evaluation methods in recommender systems research
Providing standardized datasets with reproducible splits for consistent benchmarking
Introducing hidden tests to assess model generalization on realistic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized evaluation framework with reproducible data splits
Synthetic dataset from privacy-guaranteed browsing data
Hidden test leaderboard using webpage recommendation task
πŸ”Ž Similar Papers
No similar papers found.
J
Jingyuan He
Language Technologies Institute, Carnegie Mellon University
Jiongnan Liu
Jiongnan Liu
Gaoling School of Artificial Intelligence, Renmin University of China
Information Retrieval
V
Vishan Vishesh Oberoi
Language Technologies Institute, Carnegie Mellon University
Bolin Wu
Bolin Wu
Beijing University of Posts and Telecommunications
Information theoryCoding theory
M
Mahima Jagadeesh Patel
Language Technologies Institute, Carnegie Mellon University
K
Kangrui Mao
Language Technologies Institute, Carnegie Mellon University
C
Chuning Shi
Language Technologies Institute, Carnegie Mellon University
I-Ta Lee
I-Ta Lee
Meta
Arnold Overwijk
Arnold Overwijk
Unknown affiliation
Language ModelsRecommendationInformation RetrievalNatural Language Understanding
Chenyan Xiong
Chenyan Xiong
Associate Professor, Carnegie Mellon University
Information RetrievalLanguage ModelsNatural Language Understanding.