🤖 AI Summary
Existing benchmarks inadequately evaluate embodied agents’ long-horizon planning capabilities under dense, real-world constraints. Method: We introduce FoodDeliver—the first city-scale, profit-oriented embodied agent benchmark—grounded in real-world food delivery operations. It simulates agents optimizing net profit, delivery deadlines, battery consumption, traffic costs, and multi-agent coordination within a procedurally generated 3D urban environment. Our approach integrates vision-language model (VLM)-based agent architectures, dynamic resource modeling, and multi-agent simulation. Contribution/Results: Experiments across nine cities reveal that state-of-the-art VLM agents suffer from myopic decision-making, violations of commonsense reasoning, and frequent constraint breaches. Distinct behavioral preferences emerge across models (e.g., GPT-5 exhibits aggressive scheduling; Claude adopts conservative strategies). All models significantly underperform human delivery professionals. This work pioneers the systematic integration of occupation-level objectives and high-dimensional real-world constraints into embodied AI evaluation.
📝 Abstract
LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a range of VLM-based agents across nine cities and compare them with human players. Our results reveal a substantial performance gap to humans, and find that these agents are short-sighted and frequently break basic commonsense constraints. Additionally, we observe distinct personalities across models (e.g., adventurous GPT-5 vs. conservative Claude), highlighting both the brittleness and the diversity of current VLM-based embodied agents in realistic, constraint-dense environments. Our code, data, and benchmark are available at https://deliverybench.github.io.