InnoGym: Benchmarking the Innovation Potential of AI Agents

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI benchmarks predominantly evaluate answer correctness, neglecting the methodological originality and practical efficacy of solutions—thus failing to assess agents’ genuine innovation potential. Method: We propose iNovelty, the first systematic benchmark framework for evaluating AI agents’ innovation potential across 18 real-world engineering and scientific tasks. It introduces a dual-axis evaluation metric—“performance gain” and “method novelty”—and ensures assessment reliability via resource-aware filtering, expert validation, and standardized solution curation. Evaluation is conducted within the unified iGym environment to enable reproducible, long-horizon testing. Contribution/Results: Experiments reveal that while current AI agents generate methodologically novel solutions, their limited robustness prevents consistent translation into stable performance improvements—highlighting the critical need for co-optimizing innovation and practical efficacy in AI agent design.

Technology Category

Application Category

📝 Abstract
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
Problem

Research questions and friction points this paper is trying to address.

Evaluates AI agents' innovation potential beyond correctness
Measures performance gain and novelty in solution approaches
Highlights gap between creativity and effectiveness in AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark measures performance gain and novelty
Standardized tasks from engineering and scientific domains
Unified execution environment for reproducible evaluations
Jintian Zhang
Jintian Zhang
Zhejiang University
NLPLLMs
K
Kewei Xu
Zhejiang University
J
Jingsheng Zheng
Zhejiang University; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Z
Zhuoyun Yu
Zhejiang University
Y
Yuqi Zhu
Zhejiang University; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Y
Yujie Luo
Zhejiang University; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
L
Lanning Wei
Ant Group; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Shuofei Qiao
Shuofei Qiao
Zhejiang University
AI AgentLarge Language ModelsNatural Language ProcessingKnowledge Graphs
L
Lun Du
Ant Group; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Da Zheng
Da Zheng
Amazon
High-performance computingData-intensive computingLarge-scale machine learningGraph neural networks
Shumin Deng
Shumin Deng
National University of Singapore
NLPLLM Planning & ReasoningLLM AgentKGIE
H
Huajun Chen
Zhejiang University; Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
Ningyu Zhang
Ningyu Zhang
Ph.D. Student, Vanderbilt University
artificial intelligencelearning analyticslearning environments