CogSR: Semantic-Aware Speech Super-Resolution via Chain-of-Thought Guided Flow Matching

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative models for super-resolution reconstruction of historical and surveillance audio at extremely low sampling rates (e.g., <4 kHz) suffer from severe semantic hallucination due to critically insufficient acoustic cues. To address this, we propose a semantics-guided cognitive restoration framework: (1) introducing Chain-of-Thought (CoT) reasoning as an explicit semantic anchor to model linguistic structure; (2) jointly optimizing speech content fidelity and speaker identity consistency via rectified flow-based generation coupled with an acoustic identity constraint module. Our method significantly mitigates ambiguity under severe degradation, achieving a 23.6% improvement in word-level accuracy and a 4.8 dB gain in PSNR for high-frequency details. This work presents the first end-to-end solution for high-fidelity audio restoration that simultaneously preserves linguistic semantics and acoustic authenticity—enabling reliable applications in forensic audio analysis and digital humanities.

Technology Category

Application Category

📝 Abstract
Applying speech super-resolution (SR) to recordings with severely low sampling rates is a critical challenge in digital archiving and investigative audio recovery. In these scenarios, the input lacks essential acoustic cues. Consequently, existing generative models often fail; without sufficient context, they hallucinate phonetic content, guessing words based on probability rather than meaning. To address this, we propose CogSR, a framework designed specifically for high-precision, offline restoration. Our approach shifts the focus from simple signal mapping to cognitive reconstruction. By integrating a Large Audio-Language Model, we employ Chain-of-Thought reasoning to act as a semantic anchor, while explicit acoustic priors ensure the speaker's identity remains consistent. This guides a Rectified Flow backbone to synthesize high-frequency details that are not only realistic but linguistically accurate. Evaluations show that CogSR effectively eliminates ambiguity in severe degradation regimes, making it a robust solution for restoring high-value legacy and surveillance audio.
Problem

Research questions and friction points this paper is trying to address.

Restores low-sampling-rate speech with semantic accuracy
Prevents hallucination of phonetic content in degraded audio
Ensures speaker identity consistency in super-resolution reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought reasoning guides semantic reconstruction
Rectified Flow backbone synthesizes realistic high-frequency details
Large Audio-Language Model integrates acoustic and linguistic priors
J
Jiajun Yuan
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China
X
Xiaochen Wang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China
Yuhang Xiao
Yuhang Xiao
Shenzhen University
Y
Yulin Wu
School of Artificial Intelligence, Jianghan University, Wuhan, China
Chenhao Hu
Chenhao Hu
Department of Psychology, Tsinghua University
environmental psychologyintervention studieshealth psychologysocial psychology
X
Xueyang Lv
Xiaomi Corporation, Beijing, China