Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of uncontrolled script output in multilingual speech foundation models—such as Whisper—when transcribing regional variants of the same language that employ different writing scripts. The study presents the first evidence that script information is disentangled within the linear activation space of these models. Building on this insight, the authors propose a zero-shot inference-time intervention that modulates intermediate activations using script-specific vectors, enabling controllable cross-script transcription without any retraining. This approach supports arbitrary language–script pairings (e.g., transcribing Italian speech into Cyrillic script) and achieves competitive performance across all Whisper model scales, substantially enhancing the controllability of speech recognition systems over output script choice.

Technology Category

Application Category

📝 Abstract
Multilingual speech foundation models such as Whisper are trained on web-scale data, where data for each language consists of a myriad of regional varieties. However, different regional varieties often employ different scripts to write the same language, rendering speech recognition output also subject to non-determinism in the output script. To mitigate this problem, we show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script. We find the addition of such script vectors to activations at test time can induce script change even in unconventional language-script pairings (e.g. Italian in Cyrillic and Japanese in Latin script). We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
Problem

Research questions and friction points this paper is trying to address.

speech recognition
script variation
multilingual models
output non-determinism
transliteration
Innovation

Methods, ideas, or system contributions that make the work stand out.

linear script representation
zero-shot transliteration
speech foundation models
activation space manipulation
script control
🔎 Similar Papers