AgentBench: Evaluating LLMs as Agents

📅 2023-08-07

🏛️ International Conference on Learning Representations

📈 Citations: 385

✨ Influential: 31

career value

179K/year

🤖 AI Summary

Existing evaluations of large language models (LLMs) as autonomous agents lack systematic, quantitative assessment of their reasoning and decision-making capabilities in interactive environments. Method: This paper introduces AgentBench, a multidimensional, dynamically evolving benchmark comprising eight heterogeneous interactive environments. It proposes the first multi-environment, multi-task, evolutionary evaluation framework tailored for LLM-based agents, integrating environment simulation, multi-turn interaction tracing, failure root-cause analysis, and alignment-data impact validation. Results: Empirical evaluation across 27 mainstream models reveals that commercial models significantly outperform open-source counterparts and uncovers three fundamental bottlenecks: long-horizon reasoning decay, decision inconsistency, and fragile instruction following. The project open-sources all environments, datasets, and evaluation toolkits, establishing a reproducible, extensible infrastructure for agent intelligence assessment.

📝 Abstract

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Training on code and high quality multi-turn alignment data could improve agent performance. Datasets, environments, and an integrated evaluation package for AgentBench are released at url{https://github.com/THUDM/AgentBench}.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' reasoning and decision-making as agents

Assessing performance gaps between commercial and open-source LLMs

Identifying failure causes in long-term reasoning and instruction following

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional benchmark with eight interactive environments

Extensive testing of API-based and open-sourced LLMs

Identifies reasoning and instruction following as key challenges

🔎 Similar Papers

A Survey on Large Language Model based Autonomous Agents