🤖 AI Summary
Existing speech style recognition methods over-rely on linguistic modalities while underutilizing acoustic information, leading to suboptimal performance. To address this, we propose a serial-parallel dual-path fusion architecture: the serial path models temporal dependencies via an ASR+STYLE paradigm, while the parallel path enables cross-modal synchronous interaction through an Acoustic–Linguistic Similarity Module (ALSM). We further design a lightweight bimodal fusion network and adopt an end-to-end training strategy. Evaluated on an eight-class style recognition task, our method achieves a 30.3% absolute accuracy improvement over the OSUM baseline while reducing model parameters by 88.4%. The core contribution lies in the first integration of explicit temporal modeling with similarity-driven cross-modal interaction for speech style recognition—yielding significant gains in both accuracy and computational efficiency.
📝 Abstract
Speaking Style Recognition (SSR) identifies a speaker's speaking style characteristics from speech. Existing style recognition approaches primarily rely on linguistic information, with limited integration of acoustic information, which restricts recognition accuracy improvements. The fusion of acoustic and linguistic modalities offers significant potential to enhance recognition performance. In this paper, we propose a novel serial-parallel dual-path architecture for SSR that leverages acoustic-linguistic bimodal information. The serial path follows the ASR+STYLE serial paradigm, reflecting a sequential temporal dependency, while the parallel path integrates our designed Acoustic-Linguistic Similarity Module (ALSM) to facilitate cross-modal interaction with temporal simultaneity. Compared to the existing SSR baseline -- the OSUM model, our approach reduces parameter size by 88.4% and achieves a 30.3% improvement in SSR accuracy for eight styles on the test set.