🤖 AI Summary
The syntactic and conceptual representational capabilities of current speech-language models (SLMs) remain poorly understood.
Method: We systematically evaluate S3M, ASR models, speech codecs, and AudioLLM encoders on contextual syntactic and semantic feature encoding, introducing— for the first time—a fine-grained probing methodology based on minimal contrasting pairs, combined with diagnostic classifiers and layer-wise temporal resolution analysis across 71 linguistic tasks.
Contribution/Results: Our analysis reveals that all SLMs robustly encode syntactic features more than conceptual ones, exhibiting a clear, context-sensitive hierarchical evolution across layers. This confirms a shallow-to-deep “syntax → semantics” organizational principle in speech representations. The study establishes the first interpretable, hierarchical evaluation framework for speech–language joint modeling, enabling granular, cross-layer linguistic feature decomposition and advancing our understanding of how speech encoders structure linguistic knowledge.
📝 Abstract
Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.