ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing agent evaluation benchmarks suffer from high environmental interaction costs and imbalanced distributions of task duration and difficulty, leading to unreliable assessments. This work proposes a lightweight evaluation framework centered on a unified grid-based planning task, where task length is controlled by the number of hidden slots and difficulty is orthogonally modulated via a distractor budget. Tool invocation is implemented through static JSON files, enabling zero-runtime-overhead execution. This design facilitates efficient, reproducible evaluation suitable for in-training validation. Experiments across 13 models of varying scales and architectures demonstrate that the benchmark exhibits strong domain consistency and discriminative power, enabling controllable and interpretable assessment of agent reasoning capabilities.
📝 Abstract
Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.
Problem

Research questions and friction points this paper is trying to address.

Agent Benchmarking
Evaluation Overhead
Task Horizon
Difficulty Control
Lightweight Environment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Environment
Scalable Horizons
Controllable Difficulty
Agent Benchmarking
Static Tool Resolution
🔎 Similar Papers
No similar papers found.