DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing multimodal sentence representation methods perform well on coarse-grained cross-modal alignment but suffer from two key limitations: cross-modal misalignment bias and intra-modal semantic divergence, which hinder fine-grained semantic modeling. To address these issues, we propose a dual-level alignment learning framework. At the cross-modal level, we introduce consistency learning and ranking distillation to mitigate misalignment; at the intra-modal level, we design a global alignment module integrated with softened negative sampling and a semantic similarity auxiliary task to alleviate semantic divergence. Notably, our framework is the first to jointly model fine-grained cross-modal alignment and global intra-modal structural consistency. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art methods on semantic textual similarity (STS) benchmarks and downstream transfer tasks, validating its effectiveness in capturing complex sentence-level relationships.

Technology Category

Application Category

📝 Abstract

Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Addresses cross-modal misalignment bias in multimodal learning

Mitigates intra-modal semantic divergence in sentence representations

Enhances fine-grained alignment and ranking structure modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-level alignment for multimodal representation

Consistency learning with softened negative samples

Ranking distillation for intra-modal alignment

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation