🤖 AI Summary
Existing emotion-aware CLAP methods perform simplistic audio–text alignment, neglecting the ordinal nature of emotions—such as their structured ordering in the valence–arousal (V-A) space—leading to suboptimal cross-modal alignment and limited affective understanding. To address this, we propose a supervised contrastive learning framework that jointly models dimensional emotion representation (valence and arousal) and natural language prompts, explicitly incorporating emotion ordinality. Our key innovation is the Rank-N-Contrast objective, which captures fine-grained ordinal relationships among emotions within the V-A space. We extend the CLAP architecture with a dedicated cross-modal contrastive loss, significantly improving both audio–text embedding alignment quality and ordinal consistency. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art emotion CLAP models on cross-modal affective retrieval tasks, establishing new benchmarks in ordinal emotion-aware representation learning.
📝 Abstract
Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by na""ively aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.