The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic benchmarks in robotic learning that hinder comprehensive evaluation of models on complex, diverse behaviors. To this end, we propose GM-100, an embodied intelligence evaluation benchmark comprising 100 meticulously designed tasks, introducing for the first time an “Olympic”-style holistic assessment framework. The tasks are grounded in human manipulation primitives and object affordance analysis, spanning a broad spectrum of human-robot interaction scenarios—including long-tail cases—and are supported by multi-platform trajectory data to enable systematic evaluation of vision-language-action (VLA) models. Experiments demonstrate that GM-100 is both executable and challenging, effectively differentiating the performance of state-of-the-art VLA models, thereby establishing a high-quality, unified, and diverse evaluation standard for embodied AI research.

Technology Category

Application Category

📝 Abstract
Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.
Problem

Research questions and friction points this paper is trying to address.

embodied AI
robot learning
task design
evaluation benchmark
imitation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied AI
Task Benchmark
Human-Object Interaction
Affordance
Robot Learning
🔎 Similar Papers
No similar papers found.
Ziyu Wang
Ziyu Wang
NYU Shanghai
artificial intelligencecomputer music
C
Chenyuan Liu
SJTU
Y
Yushun Xiang
SJTU
R
Runhao Zhang
SII
Q
Qingbo Hao
SJTU
H
Hongliang Lu
SJTU
H
Houyu Chen
SJTU
Z
Zhizhong Feng
SJTU
K
Kaiyue Zheng
SJTU
D
Dehao Ye
SJTU
X
Xianchao Zeng
SII
X
Xinyu Zhou
SII
B
Boran Wen
SJTU, SII
J
Jiaxin Li
SJTU, SII
M
Mingyu Zhang
SJTU, SII
K
Kecheng Zheng
Robbyant
Q
Qian Zhu
Robbyant
R
Ran Cheng
Robbyant
Yong-Lu Li
Yong-Lu Li
Associate Professor, Shanghai Jiao Tong University/Shanghai Innovation Institute
Physical ReasoningRoboticsComputer VisionMachine LearningEmbodied AI