🤖 AI Summary
Existing large audio language models struggle to reliably control emotion in speech generation, often exhibiting emotional misalignment, linguistic distortion, or hallucination. This work proposes a training-free, inference-stage intervention method that, for the first time, identifies and leverages emotion-sensitive neurons (ESNs) at the neuronal level. By aggregating activations to select ESNs and integrating selector design, mask sparsity, filtering strategies, and intervention intensity modulation, the approach enables precise emotional control. Evaluated on three models—including Qwen2.5-Omni-7B—the method demonstrates strong generalization to unseen speakers and significantly outperforms baseline approaches in both automatic and human evaluations. This study establishes a novel, interpretable, and intervenable paradigm for emotion control in neural audio generation.
📝 Abstract
Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.