Do What I Say: A Spoken Prompt Dataset for Instruction-Following

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the limitation of current spoken large language models (SLLMs), which are predominantly evaluated using textual prompts that fail to reflect real-world spoken interaction performance. To bridge this gap, the authors introduce DOWIS, the first systematically constructed multilingual, multitask dataset pairing spoken and written prompts across nine task categories, eleven languages, and five spoken styles. Authentic spoken instructions were collected via human recordings and integrated with existing benchmark tasks to enable comprehensive evaluation of SLLMs along four dimensions: modality, language, style, and task. Experimental results reveal that textual prompts generally outperform spoken ones, particularly in low-resource and cross-lingual settings, while spoken prompts significantly narrow the performance gap only in speech-output tasks.

Technology Category

Application Category

πŸ“ Abstract
Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

spoken prompts
speech large language models
instruction-following
evaluation benchmark
multilingual
Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken prompt dataset
instruction-following
speech large language models
multilingual evaluation
prompt modality
πŸ”Ž Similar Papers
No similar papers found.