🤖 AI Summary
Existing evaluation benchmarks for task-oriented LLM agents largely overlook cultural and linguistic diversity, relying predominantly on monolingual or machine-translated test suites that fail to reflect genuine cross-lingual function-calling capabilities. Method: We introduce the first multilingual, regionally grounded agent benchmark for football ticket purchasing, covering Portuguese, English, Spanish, German, Italian, and French, with deep integration of local teams, cities, and user profiles. We propose a novel “regionalized + multilingual” co-evaluation framework to systematically uncover structural biases in cross-lingual function invocation. Contribution/Results: Experiments across commercial and open-source LLMs (e.g., GPT-5, Qwen3-235B) reveal that stronger reasoning models achieve higher overall performance, yet exhibit substantial inter-lingual disparities—highlighting critical bottlenecks in multilingual deployment. This work establishes a reproducible, culturally aware evaluation paradigm for intelligent agents.
📝 Abstract
Large language models (LLMs) are increasingly deployed as task-oriented agents, where success depends on their ability to generate accurate function calls under realistic, multilingual conditions. However, existing agent evaluations largely overlook cultural and linguistic diversity, often relying on monolingual or naively translated benchmarks. We introduce Ticket-Bench, a benchmark for multilingual agent evaluation in task-oriented scenarios. Ticket-Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French. Using localized teams, cities, and user profiles to provide a higher level of realism. We evaluate a wide range of commercial and open-source LLMs, measuring function-calling accuracy and consistency across languages. Results show that reasoning-oriented models (e.g., GPT-5, Qwen3-235B) dominate performance but still exhibit notable cross-lingual disparities. These findings underscore the need for culturally aware, multilingual benchmarks to guide the development of robust LLM agents.