VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the modality disconnection between speech and language by proposing a continuous multimodal fusion framework that unifies pretrained large language models (LLMs) with acoustic encoder-decoder architectures (e.g., Whisper). Methodologically, it introduces cross-modal attention to enable hidden-state-level fusion conditioned on audio, achieving the first alignment of speech and language representations within a continuous text latent space. The framework supports both offline and streaming speech understanding. As a key contribution, it yields VoxKrikri—the first Greek-language speech-enhanced LLM. Empirically, VoxKrikri achieves state-of-the-art performance on Greek automatic speech recognition (ASR), reducing average relative word error rate by approximately 20% over prior systems. These results demonstrate the framework’s effectiveness and generalizability across multilingual and low-resource settings.

Technology Category

Application Category

📝 Abstract

We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce extit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $sim20%$ relative improvement across benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Unifying speech and language through multimodal fusion

Building speech-enabled large language models

Improving multilingual and low-resource speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous fusion of audio and text spaces

Cross-modal attention for offline and streaming

Intermediate audio-conditioned text alignment mechanism

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models

2024-09-30arXiv.orgCitations: 3

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs