In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study investigates the mechanisms of in-context learning (ICL) in spoken language models, focusing on how acoustic features and linguistic structure influence task comprehension and vocal style imitation. Through text-to-speech (TTS) tasks, the authors analyze ICL behavior along two dimensions: content accuracy and acoustic reproducibility. Employing inductive head ablation, controlled acoustic feature manipulation, and a systematic evaluation framework, they demonstrate for the first time that speaking rate is a critical factor governing ICL performance in speech and can be effectively imitated, whereas pitch range and intensity have negligible effects. Furthermore, ablating the first *k* inductive heads entirely eliminates ICL capability, confirming their causal role analogous to that observed in textual ICL and thereby addressing a key gap in the mechanistic understanding of ICL in spoken language modeling.

Technology Category

Application Category

📝 Abstract

In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain. Here, we investigate how linguistic and acoustic features affect ICL in Speech Language Models. We focus on the Text-to-Speech (TTS) task, which allows us to analyze ICL from two angles: (1) how accurately the model infers the task from the demonstrations (i.e., generating the correct spoken content), and (2) to what extent the model mimics the acoustic characteristics of the demonstration speech in its output. We find that speaking rate strongly affects ICL performance and is also mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. Finally, we investigate the role of induction heads in speech-based ICL and show that these heads play a causal role: ablating the top-k induction heads completely removes the model's ICL ability, mirroring findings from text-based ICL.

Problem

Research questions and friction points this paper is trying to address.

In-Context Learning

Speech Language Models

Acoustic Features

Linguistic Structure

Induction Heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Learning

Speech Language Models

Acoustic Features