Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing evaluation benchmarks for task-oriented LLM agents largely overlook cultural and linguistic diversity, relying predominantly on monolingual or machine-translated test suites that fail to reflect genuine cross-lingual function-calling capabilities. Method: We introduce the first multilingual, regionally grounded agent benchmark for football ticket purchasing, covering Portuguese, English, Spanish, German, Italian, and French, with deep integration of local teams, cities, and user profiles. We propose a novel “regionalized + multilingual” co-evaluation framework to systematically uncover structural biases in cross-lingual function invocation. Contribution/Results: Experiments across commercial and open-source LLMs (e.g., GPT-5, Qwen3-235B) reveal that stronger reasoning models achieve higher overall performance, yet exhibit substantial inter-lingual disparities—highlighting critical bottlenecks in multilingual deployment. This work establishes a reproducible, culturally aware evaluation paradigm for intelligent agents.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed as task-oriented agents, where success depends on their ability to generate accurate function calls under realistic, multilingual conditions. However, existing agent evaluations largely overlook cultural and linguistic diversity, often relying on monolingual or naively translated benchmarks. We introduce Ticket-Bench, a benchmark for multilingual agent evaluation in task-oriented scenarios. Ticket-Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French. Using localized teams, cities, and user profiles to provide a higher level of realism. We evaluate a wide range of commercial and open-source LLMs, measuring function-calling accuracy and consistency across languages. Results show that reasoning-oriented models (e.g., GPT-5, Qwen3-235B) dominate performance but still exhibit notable cross-lingual disparities. These findings underscore the need for culturally aware, multilingual benchmarks to guide the development of robust LLM agents.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual agent performance in realistic scenarios

Addressing cultural and linguistic diversity gaps in benchmarks

Measuring function-calling accuracy across six major languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for agent evaluation

Localized soccer ticket purchase simulation

Function-calling accuracy across languages

🔎 Similar Papers

COMMA: A Communicative Multimodal Multi-Agent Benchmark