ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning disentangled paralinguistic style representations—such as emotion, age, and gender—from speech to jointly support both recognition and generation tasks. The authors propose ParaMETA, a novel framework that, for the first time, achieves disentangled embeddings of multiple paralinguistic attributes within a single model through a subspace projection mechanism. Integrated with a multi-task learning architecture, ParaMETA effectively mitigates task interference and negative transfer. The approach is compatible with both speech and text prompts and can be seamlessly incorporated into end-to-end text-to-speech (TTS) systems. Experimental results demonstrate that ParaMETA outperforms strong baselines across multiple paralinguistic classification tasks, generates more natural and expressive speech, and maintains a lightweight, efficient design with strong practical applicability.

Technology Category

Application Category

📝 Abstract
Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In this work, we introduce ParaMETA, a unified and flexible framework for learning and controlling speaking styles directly from speech. Unlike existing methods that rely on single-task models or cross-modal alignment, ParaMETA learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each type of style. This design reduces inter-task interference, mitigates negative transfer, and allows a single model to handle multiple paralinguistic tasks such as emotion, gender, age, and language classification. Beyond recognition, ParaMETA enables fine-grained style control in Text-To-Speech (TTS) generative models. It supports both speech- and text-based prompting and allows users to modify one speaking styles while preserving others. Extensive experiments demonstrate that ParaMETA outperforms strong baselines in classification accuracy and generates more natural and expressive speech, while maintaining a lightweight and efficient model suitable for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

disentangled representations
paralinguistic styles
speech representation learning
style control
multitask learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation
paralinguistic style
speech embedding
multi-task learning
style-controllable TTS
🔎 Similar Papers
No similar papers found.