SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This paper addresses the challenge of multilingual, multimodal speech–text semantic alignment. To this end, we propose SENSE, a teacher–student distillation framework that performs utterance-level alignment between a self-supervised speech encoder (XLSR-based) and a language-agnostic continuous text encoder—enhanced via a stronger teacher text model and optimized initialization of the speech encoder. Our approach integrates the SAMU-XLSR architecture with cross-modal semantic alignment techniques. Notably, we are the first to fully open-source both training and inference code, seamlessly integrated into the SpeechBrain toolkit. Experiments demonstrate that SENSE achieves state-of-the-art or highly competitive performance on multilingual speech–text retrieval and cross-lingual semantic similarity tasks. These results validate its effectiveness and strong generalization capability for unsupervised, cross-lingual, and cross-modal semantic understanding.

Technology Category

Application Category

📝 Abstract

This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder. The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders.

Problem

Research questions and friction points this paper is trying to address.

Develop multilingual multimodal semantic task model

Align speech and text encoders across languages

Enhance SAMU-XLSR with improved teacher components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-student framework aligns speech and text

Improved SAMU-XLSR with better teacher model

Integrated into SpeechBrain toolkit for accessibility

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs