ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current SLU research lacks standardized, automated metrics for evaluating the quality of cross-modal latent-space alignment between speech and text—hindering interpretability and optimization of multimodal large language models (MLLMs). To address this, we propose ALAS (Alignment Assessment via Latent Similarity), the first fully automatic, annotation-free implicit alignment metric. ALAS computes cosine similarity between speech and text embeddings across Transformer layers, incorporating inter-layer normalization and task-adaptive aggregation to enable cross-layer and cross-task consistency analysis. Evaluated on spoken question answering and speech emotion recognition, ALAS accurately captures the depth-wise evolution of alignment capability, demonstrating strong discriminability, generalizability, and reproducibility. It provides a reliable, quantitative tool for analyzing alignment mechanisms and guiding architectural design in MLLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.

Problem

Research questions and friction points this paper is trying to address.

Measuring cross-modal alignment in speech-text LLMs

Lack of standard metric for audio-text alignment

Proposing ALAS to evaluate latent alignment quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ALAS metric for alignment evaluation

Analyzes audio-text correlation in transformers

Validates metric across layers and tasks

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models