Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current speech large language models struggle to accurately comprehend and generate expressive speech due to limitations in leveraging paralinguistic cues—such as prosody and emotion—stemming from data scarcity, annotation challenges, and reliance on lexical shortcuts. To address these issues, this work proposes PALLM (Paralinguistic-Aware Large Language Model), which explicitly models paralinguistic reasoning through a multi-task reinforcement learning framework augmented with chain-of-thought prompting. PALLM employs a two-stage joint training strategy to simultaneously optimize audio-based emotion classification and paralinguistic-aware response generation. Evaluated on the Expresso, IEMOCAP, and RAVDESS datasets, PALLM outperforms supervised baselines and strong contemporary models—including Gemini-2.5-Pro and GPT-4o-audio—by 8–12% on paralinguistic understanding tasks, effectively mitigating data bottlenecks and enhancing emotional expressiveness.

Technology Category

Application Category

📝 Abstract

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

Problem

Research questions and friction points this paper is trying to address.

paralinguistic understanding

speech LLMs

data scarcity

annotation difficulty

lexical shortcuts

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task reinforcement learning

paralinguistic understanding

speech LLMs