Survey on Evaluation of LLM-based Agents

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This paper addresses the lack of a systematic framework for evaluating LLM-based agents. We propose the first four-dimensional taxonomy—encompassing foundational capabilities, domain-specific applications, general-purpose benchmarks, and evaluation frameworks—derived from a systematic literature review and multidimensional modeling of over one hundred empirical evaluation practices. Our analysis reveals an emerging trend toward realism and dynamism in agent evaluation, while identifying critical gaps in cost-efficiency, safety, robustness, and fine-grained scalable assessment. The main contributions are: (1) the first comprehensive, multi-dimensional taxonomy for LLM agent evaluation; (2) a holistic evaluation landscape map that clarifies current limitations; and (3) six concrete, actionable research directions to advance standardized, trustworthy agent evaluation. This work provides theoretical foundations for rigorous, reproducible, and application-aware assessment methodologies in the evolving field of LLM agents.

Technology Category

Application Category

📝 Abstract

The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

Problem

Research questions and friction points this paper is trying to address.

Comprehensive survey of LLM-based agent evaluation methodologies.

Analysis of benchmarks across planning, tool use, and memory.

Identification of gaps in cost-efficiency, safety, and robustness evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive survey on LLM-based agent evaluation

Analysis across four critical evaluation dimensions

Identifies trends, gaps, and future research directions

🔎 Similar Papers

A Survey on Large Language Model based Autonomous Agents