🤖 AI Summary
Current SLU research lacks standardized, automated metrics for evaluating the quality of cross-modal latent-space alignment between speech and text—hindering interpretability and optimization of multimodal large language models (MLLMs). To address this, we propose ALAS (Alignment Assessment via Latent Similarity), the first fully automatic, annotation-free implicit alignment metric. ALAS computes cosine similarity between speech and text embeddings across Transformer layers, incorporating inter-layer normalization and task-adaptive aggregation to enable cross-layer and cross-task consistency analysis. Evaluated on spoken question answering and speech emotion recognition, ALAS accurately captures the depth-wise evolution of alignment capability, demonstrating strong discriminability, generalizability, and reproducibility. It provides a reliable, quantitative tool for analyzing alignment mechanisms and guiding architectural design in MLLMs.
📝 Abstract
Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.