CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM agent evaluations overemphasize task completion rates while neglecting cost-optimal planning and dynamic adaptability. Method: We introduce the first multi-round, economically rational, dynamic cost evaluation benchmark, centered on travel planning. It systematically assesses both static optimality and real-time re-planning robustness via atomic/composite tool invocations, configurable tool costs, and four types of dynamic disruption events. The benchmark enables quantitative comparison across both open- and closed-weight models. Contribution/Results: Experiments reveal that state-of-the-art agents achieve less than 75% cost-optimal solution matching in static tasks and suffer ~40% performance degradation under dynamic conditions. This work bridges a critical gap in jointly evaluating cost-efficiency and dynamic adaptability of LLM agents, establishing a new standard for economically rational agent development.

Technology Category

Application Category

📝 Abstract

Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents'ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents'economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cost-optimal planning for LLM agents in dynamic environments

Assessing economic reasoning and replanning abilities during tool-use tasks

Measuring agent adaptability to real-world unpredictability like tool failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CostBench benchmark for cost-optimal planning

Simulates dynamic blocking events requiring real-time adaptation

Evaluates agents' economic reasoning with customizable tool costs

🔎 Similar Papers

Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning