VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the modality disconnection between speech and language by proposing a continuous multimodal fusion framework that unifies pretrained large language models (LLMs) with acoustic encoder-decoder architectures (e.g., Whisper). Methodologically, it introduces cross-modal attention to enable hidden-state-level fusion conditioned on audio, achieving the first alignment of speech and language representations within a continuous text latent space. The framework supports both offline and streaming speech understanding. As a key contribution, it yields VoxKrikri—the first Greek-language speech-enhanced LLM. Empirically, VoxKrikri achieves state-of-the-art performance on Greek automatic speech recognition (ASR), reducing average relative word error rate by approximately 20% over prior systems. These results demonstrate the framework’s effectiveness and generalizability across multilingual and low-resource settings.

Technology Category

Application Category

📝 Abstract
We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce extit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $sim20%$ relative improvement across benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Unifying speech and language through multimodal fusion
Building speech-enabled large language models
Improving multilingual and low-resource speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous fusion of audio and text spaces
Cross-modal attention for offline and streaming
Intermediate audio-conditioned text alignment mechanism
🔎 Similar Papers