🤖 AI Summary
This work addresses the limitations in current AudioLLM development stemming from a scarcity of diverse, character-consistent, and instruction-aligned speech-text data, particularly regarding dialect coverage and speaker identity preservation. To overcome this, the authors propose a controllable generation framework that integrates World Values Survey–based persona construction, fine-grained dialogue scenario classification, and reference-audio-conditioned speech synthesis. Leveraging large language models, the framework generates multi-turn dialogues with consistent character traits and synthesizes speech conditioned on reference utterances to retain speaker characteristics and dialectal diversity. The project introduces MENASpeechBank, comprising 18,000 real utterances from 124 speakers across the Middle East and North Africa, alongside 417,000 high-quality synthetic dialogues spanning English, Modern Standard Arabic, and regional dialects. Evaluations confirm the data’s effectiveness, and all resources will be publicly released to advance community research.
📝 Abstract
Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.