VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the modality disconnection between speech and language by proposing a continuous multimodal fusion framework that unifies pretrained large language models (LLMs) with acoustic encoder-decoder architectures (e.g., Whisper). Methodologically, it introduces cross-modal attention to enable hidden-state-level fusion conditioned on audio, achieving the first alignment of speech and language representations within a continuous text latent space. The framework supports both offline and streaming speech understanding. As a key contribution, it yields VoxKrikri—the first Greek-language speech-enhanced LLM. Empirically, VoxKrikri achieves state-of-the-art performance on Greek automatic speech recognition (ASR), reducing average relative word error rate by approximately 20% over prior systems. These results demonstrate the framework’s effectiveness and generalizability across multilingual and low-resource settings.

Technology Category

Application Category

📝 Abstract
We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce extit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $sim20%$ relative improvement across benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Unifying speech and language through multimodal fusion
Building speech-enabled large language models
Improving multilingual and low-resource speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous fusion of audio and text spaces
Cross-modal attention for offline and streaming
Intermediate audio-conditioned text alignment mechanism
🔎 Similar Papers
No similar papers found.
D
Dimitrios Damianos
Institute for Speech and Language Processing, Athena Research Center, Greece
L
Leon Voukoutis
Institute for Speech and Language Processing, Athena Research Center, Greece
Georgios Paraskevopoulos
Georgios Paraskevopoulos
Associate Researcher, Institute for Speech and Language Processing, Athena RC
Multimodal ProcessingDeep LearningNLPDomain adaptation
Vassilis Katsouros
Vassilis Katsouros
Athena Research Center - Institute for Language and Speech Processing