🤖 AI Summary
Existing large language models (LLMs) lack systematic evaluation of their ability to follow long, complex instructions—particularly those involving structured constraints such as tool specifications and conditional logic—in agent-based scenarios. Method: We introduce AgentIF, the first agent-oriented benchmark for long-instruction following, constructed from 50 real-world industrial and open-source agent applications. Instructions average 1,723 words and contain 11.9 constraints each. We formally define and evaluate LLMs’ capability to adhere to diverse structured constraints, employing three automated assessment paradigms: code execution, LLM-assisted verification, and hybrid evaluation—complemented by human annotation and deterministic validation. Contribution/Results: Experiments reveal substantial deficiencies in state-of-the-art LLMs regarding complex constraint parsing and precise tool invocation. We publicly release the dataset and evaluation framework to advance research on agent-level instruction alignment.
📝 Abstract
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.