🤖 AI Summary
This study addresses the misalignment between evaluation metrics and real-world performance in industrial multi-agent cooperative scheduling competitions. Leveraging data from the CODS 2025 AssetOpsBench Challenge—including leaderboards, submission logs, registration records, and source code—it quantifies, for the first time, the weak correlation (r = –0.13) between public and private test set scores on execution tasks. The analysis reveals a critical flaw in the current composite scoring design: the public leaderboard exhibits a 72.73% saturation rate on planning tasks, while top-performing solutions predominantly rely on response selection and context control mechanisms rather than novel architectural innovations. To address these issues, the work proposes an improved evaluation framework grounded in skill-based diagnostics and versioned artifact release.
📝 Abstract
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning ($r{=}0.69$) but negatively in execution ($r{=}{-}0.13$), with several 45.45\% public execution systems reaching 63.64\% on the hidden set. Third, the \tmatch{} term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.