Thinking While Listening: Simple Test Time Scaling For Audio Classification

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of explicit reasoning capabilities and test-time scaling support in audio classification. To this end, we propose the “Listen-and-Reason” framework: it freezes a small language model (e.g., GPT-2), fine-tunes only its embedding matrix, and introduces a sampling-trajectory-driven test-time reasoning mechanism. We further integrate open-source reasoning models—including GPT-OSS-20B and Qwen3-14B—for comparative validation. To our knowledge, this is the first work to systematically incorporate large-model reasoning into audio classification, achieving both lightweight design (parameter count far below 1B) and strong generalization. Experiments demonstrate significant improvements in classification accuracy across multiple settings; test-time scaling yields consistent performance gains; and zero-shot inference outperforms language models of comparable scale. The framework bridges audio perception and structured reasoning without requiring full model finetuning, enabling efficient, scalable, and interpretable audio understanding.

Technology Category

Application Category

📝 Abstract
We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing audio classification by enabling models to reason while listening
Incorporating thinking capabilities into existing audio classification pipelines
Designing new architectures supporting thinking with test-time scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling framework for audio classification
Retraining embedding matrix of frozen smaller models
Reasoning in category space during audio processing
🔎 Similar Papers
No similar papers found.
P
Prateek Verma
Department of Electrical Engineering, Stanford University
Mert Pilanci
Mert Pilanci
Stanford University
Machine LearningOptimizationNeural NetworksSignal ProcessingInformation Theory