Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing large audio language models struggle to reliably control emotion in speech generation, often exhibiting emotional misalignment, linguistic distortion, or hallucination. This work proposes a training-free, inference-stage intervention method that, for the first time, identifies and leverages emotion-sensitive neurons (ESNs) at the neuronal level. By aggregating activations to select ESNs and integrating selector design, mask sparsity, filtering strategies, and intervention intensity modulation, the approach enables precise emotional control. Evaluated on three models—including Qwen2.5-Omni-7B—the method demonstrates strong generalization to unseen speakers and significantly outperforms baseline approaches in both automatic and human evaluations. This study establishes a novel, interpretable, and intervenable paradigm for emotion control in neural audio generation.

Technology Category

Application Category

📝 Abstract

Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.

Problem

Research questions and friction points this paper is trying to address.

emotion control

speech generation

large audio-language models

neuron-level intervention

content preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion-sensitive neurons

training-free control

speech-generative LALMs

neuron-level intervention

content-preserving emotion steering

🔎 Similar Papers

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

2024-10-02arXiv.orgCitations: 2