TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality audio-video–text paired data and the limited precision of existing alignment methods in post-training speech large language models. The authors propose a controllable CTC-based simulation framework that, for the first time, enables explicit control over word error rate (WER) and uncertainty, generating difficulty-adjustable textual supervision signals without requiring text-to-speech (TTS) synthesis. This approach facilitates principled curriculum learning strategies and achieves significant improvements over strong baselines—including TASU, text-only fine-tuning, and TTS-augmented methods—across multiple domain transfer tasks, while effectively mitigating performance degradation on the source domain.
📝 Abstract
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.
Problem

Research questions and friction points this paper is trying to address.

speech LLM alignment
low-resource adaptation
CTC simulation
WER control
text-only supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

controllable CTC simulation
speech LLM alignment
low-resource adaptation
text-only supervision
curriculum learning
🔎 Similar Papers
No similar papers found.
Jing Peng
Jing Peng
Shanghai Jiao Tong University
Automatic Speech RecognitionSpeech Large Language Model
Chenghao Wang
Chenghao Wang
Northeastern University
Robotics
Yi Yang
Yi Yang
Associate Professor, HKUST Business School, Hong Kong University of Science and Technology
Machine LearningNLPLarge Language Models
L
Lirong Qian
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China
Junjie Li
Junjie Li
Shanghai Jiao Tong University
Computer Vision
Y
Yu Xi
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China
Shuai Wang
Shuai Wang
Nanjing University
AI
K
Kai Yu
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China