Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work investigates whether frozen large language models (LLMs) can implicitly perceive paralinguistic features—such as prosody and emotion—from speech without fine-tuning. To this end, we propose an end-to-end trainable speech encoder that maps raw audio into semantically and affectively enriched tokens, directly fed into a frozen LLM to generate empathetic responses. Our key contributions are twofold: (1) the first demonstration that implicit paralinguistic perception can be conferred to frozen LLMs without updating their weights; and (2) a semantic-affective alignment paradigm for speech token learning, jointly optimized via cross-modal alignment and multi-task objectives targeting emotion- and style-conditioned response generation. Experiments show substantial improvements in response quality and empathy when processing expressive speech cues, outperforming strong baselines across multiple objective and subjective metrics. These results validate that frozen LLMs can effectively leverage paralinguistic signals when augmented with appropriately designed, modality-aligned speech representations.

Technology Category

Application Category

📝 Abstract

As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users' emotions or speaking styles when providing their responses. In this work, we study the potential of an LLM to understand these aspects of speech without fine-tuning its weights. To do this, we utilize an end-to-end system with a speech encoder; the encoder is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt where the speaker's emotion has also been specified. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech and effectively convey it to the LLM, even when the LLM remains completely frozen. We also explore training on additional emotion and style-related response alignment tasks, finding that they further increase the amount of paralinguistic information explicitly captured in the speech tokens. Experiments demonstrate that our system is able to produce higher quality and more empathetic responses to expressive speech prompts compared to several baselines.

Problem

Research questions and friction points this paper is trying to address.

Understand paralinguistic speech without fine-tuning LLM

Align expressive speech and text prompts for frozen LLM

Capture linguistic and paralinguistic info in token embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frozen LLM understands paralinguistic speech without fine-tuning

Speech encoder aligns token embeddings with expressive prompts

Encoder captures linguistic and paralinguistic info for frozen LLM

🔎 Similar Papers

No similar papers found.