🤖 AI Summary
Current evaluation benchmarks for task-oriented spoken language assistants struggle to capture the linguistic diversity and complexity inherent in real-world user requests, particularly in beverage ordering scenarios where challenges such as entity recognition, customizable options, and spontaneous speech phenomena are prevalent. To address this gap, this work introduces and releases StarDrinks, a multilingual (English and Korean) spoken beverage-ordering test set that uniquely integrates raw audio, human-transcribed utterances, and fine-grained slot annotations. This dataset enables end-to-end, multi-level evaluation spanning automatic speech recognition (ASR), natural language understanding (NLU), and spoken language understanding (SLU). By realistically modeling linguistic variation in authentic interactions, StarDrinks fills a critical void in evaluation benchmarks for complex task-oriented domains and provides a robust foundation for advancing the robustness and generalization capabilities of spoken language understanding systems.
📝 Abstract
LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.