🤖 AI Summary
This work addresses the modality disconnection between speech and language by proposing a continuous multimodal fusion framework that unifies pretrained large language models (LLMs) with acoustic encoder-decoder architectures (e.g., Whisper). Methodologically, it introduces cross-modal attention to enable hidden-state-level fusion conditioned on audio, achieving the first alignment of speech and language representations within a continuous text latent space. The framework supports both offline and streaming speech understanding. As a key contribution, it yields VoxKrikri—the first Greek-language speech-enhanced LLM. Empirically, VoxKrikri achieves state-of-the-art performance on Greek automatic speech recognition (ASR), reducing average relative word error rate by approximately 20% over prior systems. These results demonstrate the framework’s effectiveness and generalizability across multilingual and low-resource settings.
📝 Abstract
We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce extit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $sim20%$ relative improvement across benchmarks.