🤖 AI Summary
This study addresses the fragmented evaluation landscape in exercise recommendation research, where item-level (ILER) and path-level (PLER) approaches are assessed under incompatible frameworks, hindering fair comparison and impeding progress. To bridge this gap, we propose UniER, a unified benchmark that establishes the first cross-paradigm evaluation framework, introducing Weighted Cognitive Gain (WCG) as a consistent metric. We systematically evaluate 18 representative methods across nine datasets through multidimensional analysis—covering effectiveness, generalization, robustness, and efficiency—and leverage multi-source synthetic data to uncover PLER’s systematic advantages under sparse and noisy conditions, while exposing the pedagogical limitations of ILER’s fragmented recommendations. Results demonstrate that PLER significantly outperforms ILER, especially in extreme scenarios. We further release the benchmark platform and code to foster standardized, reproducible research in exercise recommendation.
📝 Abstract
Personalized exercise recommendation dynamically aligns pedagogical resources with individual knowledge mastery, which is crucial for satisfying students' dynamic learning needs in modern education. The field is currently driven by two dominant paradigms: Item-Level Exercise Recommendation (ILER) optimizes for immediate single-step state transitions, while Path-Level Exercise Recommendation (PLER) constructs coherent learning paths to maximize cumulative gains. Despite sharing the same ultimate objective, disparate evaluation setups have kept these two lines of research isolated, hindering unified benchmarking and fair comparison. To fill the gap, in this paper, we present a Unified Benchmark for Exercise Recommendation (UniER), a comprehensive evaluation framework that unifies ILER and PLER. Specifically, we introduce Weighted Cognitive Gain (WCG) as a unified metric to measure cross-paradigm algorithmic performance. Our benchmark encompasses 9 datasets spanning four generation methods, facilitating the comparison of 18 representative ILER/PLER methods. Through multi-dimensional analyses covering effectiveness, generalizability, robustness, and efficiency, our results reveal the systematic dominance of PLER and expose the pedagogical failure of ILER's fragmented recommendations under extreme sparsity and noise. Furthermore, we provide an open-source codebase of UniER to foster reproducible research and outline potential directions for future investigations.