AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing AI agent evaluations, which predominantly focus on task difficulty while overlooking the diverse needs of everyday users. We propose the first task-level instruction-following benchmark tailored to common scenarios in learning, work, and daily life, emphasizing natural language instruction understanding, attachment handling, and deliverable generation. The benchmark encompasses three user-centered task types: open-ended workflow execution, implicit intent inference, and iterative refinement. Innovatively, we introduce instance-level scoring criteria and a human–AI alignment evaluation protocol, achieving an 80.1% agreement rate between large language models (e.g., Gemini-1.5-Pro) and human raters. The resulting benchmark comprises 104 tasks and 767 scoring points. Evaluations reveal that API-based agents perform comparably to reinforcement learning–enhanced ChatGPT agents, demonstrating that leading large models now possess practical agentic capabilities.

Technology Category

Application Category

📝 Abstract
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
Problem

Research questions and friction points this paper is trying to address.

AI agents
instruction-following
daily scenarios
task diversity
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following benchmark
daily scenarios
latent instruction
iterative refinement
agent evaluation
🔎 Similar Papers
No similar papers found.
Kaiyuan Chen
Kaiyuan Chen
Bytedance
LLMScaling LawAI4WeatherVideo Generation
Q
Qimin Wu
T
Taiyu Hou
T
Tianhao Tang
X
Xueyu Hu
Yuchen Hou
Yuchen Hou
PhD student in Computer Science at University of California, Santa Barbara
Machine LearningComputational NeuroscienceData Science
B
Bikun Li
C
Chengming Qian
G
Guoyin Wang
H
Haolin Chen
H
Haotong Tian
H
Haoye Zhang
H
Haoyu Bian
H
Hongbing Pan
H
Hongkang Zhang
Hongyi Zhou
Hongyi Zhou
Karlsruhe Institute of Technology
reinforcement learningimitation learningrobotics
J
Jiaqi Cai
J
Jiewu Rao
J
Jiyuan Ren
K
Ke-Jing Huang
L
Lucia Zhu Huang
M
Mingyu Yuan
N
Naixu Guo
Q
Qicheng Tang
Q
Qinyan Zhang
S
Shuai Chen
Siheng Chen
Siheng Chen
Shanghai Jiao Tong University
Collective intelligenceLLM agentgraph signal processingcollaborative perception
T
Ting Ting Li
X
Xiaoxing Guo
Y
Yaocheng Zuo
Yaoqi Guo
Yaoqi Guo
Peking University
Software Engineering
Y
Yinan Wang
Y
Yinzhou Yu
Y
Yize Wang
Y
Yuanliang Jiang
Yuanyuan Tian
Yuanyuan Tian
Microsoft Gray Systems Lab (GSL)
Big DataSQL-on-HadoopHTAPGraph AnalyticsDatabases
Y
Yuanshuo Zhang
Y
Yuxuan Liu
Y
Yvette Yan Zeng
Z
Zenyu Shan
Zihan Yin
Zihan Yin
UW Madison
SRAMVLSI
X
Xiaobo Hu
Y
Yang Liu
Y
Yixing Ren
Y
Yuan Gong