AcademiClaw: When Students Set Challenges for AI Agents

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study addresses the limited performance of current AI agents on high-level, long-horizon tasks in real-world academic settings and the absence of domain-specific evaluation benchmarks. To bridge this gap, the authors introduce the first bilingual benchmark grounded in actual undergraduate academic needs, spanning over 25 disciplines and comprising 80 expert-curated complex tasks executed within isolated Docker environments. The framework innovatively incorporates multidimensional scoring via six complementary techniques, five categories of safety audits, support for GPU-intensive workloads, and an open-source platform. Evaluation of six state-of-the-art models reveals a maximum pass rate of only 55%, exposing significant cross-domain capability gaps, substantial strategic disparities among models, and no direct correlation between output quality and token consumption.

📝 Abstract

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

Problem

Research questions and friction points this paper is trying to address.

academic-level tasks

AI agent evaluation

complex long-horizon tasks

real-world academic workflows

capability assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

AcademiClaw

academic-level AI benchmark

long-horizon tasks