Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenges of aligning long-sequence audio and text modalities in low-resource settings, where severe dimensional imbalance, structural disruption, and information dominance hinder effective fusion. To mitigate these issues, the authors propose HILBERT, a framework built upon frozen pretrained speech and language encoders. HILBERT leverages cross-modal attention and self-attention pooling to generate both modality-specific and joint embeddings, while introducing a reciprocal dual-contrastive alignment mechanism, a Centered Kernel Alignment (CKA) regularizer to preserve structural integrity, and a mutual information regularizer to balance informational contributions across modalities. Experiments demonstrate that HILBERT consistently outperforms alternatives across diverse audio-text backbone combinations, exhibiting superior semantic representation capability and robustness—particularly in highly imbalanced multi-class tasks.

Technology Category

Application Category

📝 Abstract
We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.
Problem

Research questions and friction points this paper is trying to address.

multimodal representation
audio-text alignment
dimensional imbalance
long-sequence modeling
low-resource learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual contrastive alignment
structure-preserving regularization
information-balanced regularization
cross-attentive multimodal fusion
long-sequence embedding