LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing agent evaluation benchmarks often suffer from limited real-world relevance, low task complexity, and difficulties in result verification. To address these limitations, this work proposes Social Perception-Driven Generation (SPDG), a novel methodology that systematically constructs high-fidelity, verifiable, and complex evaluation tasks by leveraging data from social media and real-world product interactions. Building upon SPDG, we introduce LiveAgentBench—a comprehensive benchmark comprising 104 real-world scenarios and 374 tasks, including 125 validation and 249 test tasks—with support for continuous updates. Empirical evaluations of various state-of-the-art models and commercial agents reveal critical performance bottlenecks in realistic settings and highlight actionable directions for future improvement.

Technology Category

Application Category

📝 Abstract

As large language models grow more capable, general AI agents have become increasingly prevalent in practical applications. However, existing benchmarks face significant limitations, failing to represent real-world user tasks accurately. To address this gap, we present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real user requirements. It is constructed from publicly sourced questions on social media and real-world products. Central to our approach is the Social Perception-Driven Data Generation (SPDG) method, a novel process we developed to ensure each question's real-world relevance, task complexity, and result verifiability. We evaluate various models, frameworks, and commercial products using LiveAgentBench, revealing their practical performance and identifying areas for improvement. This release includes 374 tasks, with 125 for validation and 249 for testing. The SPDG process enables continuous updates with fresh queries from real-world interactions.

Problem

Research questions and friction points this paper is trying to address.

benchmarking

agentic systems

real-world tasks

large language models

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LiveAgentBench

Social Perception-Driven Data Generation

agentic systems