MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations in current AudioLLM development stemming from a scarcity of diverse, character-consistent, and instruction-aligned speech-text data, particularly regarding dialect coverage and speaker identity preservation. To overcome this, the authors propose a controllable generation framework that integrates World Values Survey–based persona construction, fine-grained dialogue scenario classification, and reference-audio-conditioned speech synthesis. Leveraging large language models, the framework generates multi-turn dialogues with consistent character traits and synthesizes speech conditioned on reference utterances to retain speaker characteristics and dialectal diversity. The project introduces MENASpeechBank, comprising 18,000 real utterances from 124 speakers across the Middle East and North Africa, alongside 417,000 high-quality synthetic dialogues spanning English, Modern Standard Arabic, and regional dialects. Evaluations confirm the data’s effectiveness, and all resources will be publicly released to advance community research.

Technology Category

Application Category

📝 Abstract
Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.
Problem

Research questions and friction points this paper is trying to address.

AudioLLMs
speech-text data
persona-grounded interactions
dialectal coverage
multi-speaker recordings
Innovation

Methods, ideas, or system contributions that make the work stand out.

persona-conditioned synthesis
multi-turn conversations
AudioLLMs
speech-text data generation
speaker identity preservation
🔎 Similar Papers
2024-09-022024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)Citations: 2
Z
Zien Sheikh Ali
Qatar Computing Research Institute, Qatar
H
Hunzalah Hassan Bhatti
Qatar Computing Research Institute, Qatar
Rabindra Nath Nandi
Rabindra Nath Nandi
Lead NLP Engineer, Hishab, Ex- Principal Software Engineer, AI, BJIT
Deep LearningLLMsChatbotsComputer Vision
Shammur Absar Chowdhury
Shammur Absar Chowdhury
Qatar Computing Research Institute
Conversational AIRepresentation LearningDeep LearningSpeech processingNLP
F
Firoj Alam
Qatar Computing Research Institute, Qatar