π€ AI Summary
This study addresses the underexplored engineering challenges in existing machine learning evaluation frameworks, where operational issues and their root causes have lacked systematic investigation. To bridge this gap, the work formally establishes evaluation engineering as a distinct research direction within software engineering. Through an empirical analysis of 57 frameworks and a comprehensive categorization of 16,560 reported issues across a newly proposed five-stage workflow model, the study reveals that 41.4% of problems originate in the specification phase, while 61.7% of classified issues stem from missing functionality, inadequate documentation, and insufficient input validation. The findings yield a structured taxonomy of evaluation-related problems and provide empirical evidence to inform the design and improvement of robust evaluation systems.
π Abstract
Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.