🤖 AI Summary
Existing generative models for super-resolution reconstruction of historical and surveillance audio at extremely low sampling rates (e.g., <4 kHz) suffer from severe semantic hallucination due to critically insufficient acoustic cues. To address this, we propose a semantics-guided cognitive restoration framework: (1) introducing Chain-of-Thought (CoT) reasoning as an explicit semantic anchor to model linguistic structure; (2) jointly optimizing speech content fidelity and speaker identity consistency via rectified flow-based generation coupled with an acoustic identity constraint module. Our method significantly mitigates ambiguity under severe degradation, achieving a 23.6% improvement in word-level accuracy and a 4.8 dB gain in PSNR for high-frequency details. This work presents the first end-to-end solution for high-fidelity audio restoration that simultaneously preserves linguistic semantics and acoustic authenticity—enabling reliable applications in forensic audio analysis and digital humanities.
📝 Abstract
Applying speech super-resolution (SR) to recordings with severely low sampling rates is a critical challenge in digital archiving and investigative audio recovery. In these scenarios, the input lacks essential acoustic cues. Consequently, existing generative models often fail; without sufficient context, they hallucinate phonetic content, guessing words based on probability rather than meaning.
To address this, we propose CogSR, a framework designed specifically for high-precision, offline restoration. Our approach shifts the focus from simple signal mapping to cognitive reconstruction. By integrating a Large Audio-Language Model, we employ Chain-of-Thought reasoning to act as a semantic anchor, while explicit acoustic priors ensure the speaker's identity remains consistent. This guides a Rectified Flow backbone to synthesize high-frequency details that are not only realistic but linguistically accurate. Evaluations show that CogSR effectively eliminates ambiguity in severe degradation regimes, making it a robust solution for restoring high-value legacy and surveillance audio.