Serial-Parallel Dual-Path Architecture for Speaking Style Recognition

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing speech style recognition methods over-rely on linguistic modalities while underutilizing acoustic information, leading to suboptimal performance. To address this, we propose a serial-parallel dual-path fusion architecture: the serial path models temporal dependencies via an ASR+STYLE paradigm, while the parallel path enables cross-modal synchronous interaction through an Acoustic–Linguistic Similarity Module (ALSM). We further design a lightweight bimodal fusion network and adopt an end-to-end training strategy. Evaluated on an eight-class style recognition task, our method achieves a 30.3% absolute accuracy improvement over the OSUM baseline while reducing model parameters by 88.4%. The core contribution lies in the first integration of explicit temporal modeling with similarity-driven cross-modal interaction for speech style recognition—yielding significant gains in both accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Speaking Style Recognition (SSR) identifies a speaker's speaking style characteristics from speech. Existing style recognition approaches primarily rely on linguistic information, with limited integration of acoustic information, which restricts recognition accuracy improvements. The fusion of acoustic and linguistic modalities offers significant potential to enhance recognition performance. In this paper, we propose a novel serial-parallel dual-path architecture for SSR that leverages acoustic-linguistic bimodal information. The serial path follows the ASR+STYLE serial paradigm, reflecting a sequential temporal dependency, while the parallel path integrates our designed Acoustic-Linguistic Similarity Module (ALSM) to facilitate cross-modal interaction with temporal simultaneity. Compared to the existing SSR baseline -- the OSUM model, our approach reduces parameter size by 88.4% and achieves a 30.3% improvement in SSR accuracy for eight styles on the test set.

Problem

Research questions and friction points this paper is trying to address.

Enhances speaking style recognition by fusing acoustic and linguistic information

Reduces model parameters while improving recognition accuracy significantly

Addresses limited multimodal integration in existing speaking style recognition approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Serial-parallel dual-path architecture for bimodal recognition

ASR+STYLE serial path with temporal dependency

Parallel path with cross-modal similarity module

🔎 Similar Papers

No similar papers found.