A Comparative Analysis on ASR System Combination for Attention, CTC, Factored Hybrid, and Transducer Models

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ASR model combination methods suffer from evaluation bias due to heterogeneous decoding paradigms (e.g., autoregressive vs. non-autoregressive, frame-level vs. sequence-level outputs) and incompatible label units, hindering fair cross-architecture comparison. Method: This work systematically investigates performance differences and complementarity among four dominant ASR architectures—Attention-based, CTC, Factored Hybrid, and Transducer models—and proposes a unified two-stage decoding framework: (1) independent N-best generation per model; (2) sequence-level log-linear score fusion followed by beam-rescoring for robustness. Contribution/Results: The framework eliminates architectural biases, enabling fair, unit-agnostic evaluation across paradigms. On LibriSpeech 960h, ensemble decoding achieves a 12.3% relative WER reduction over the best single-model baseline, demonstrating the effectiveness and generalizability of heterogeneous ASR architecture collaboration.

Technology Category

Application Category

📝 Abstract
Combination approaches for speech recognition (ASR) systems cover structured sentence-level or word-based merging techniques as well as combination of model scores during beam search. In this work, we compare model combination across popular ASR architectures. Our method leverages the complementary strengths of different models in exploring diverse portions of the search space. We rescore a joint hypothesis list of two model candidates. We then identify the best hypothesis through log-linear combination of these sequence-level scores. While model combination during first-pass recognition may yield improved performance, it introduces variability due to differing decoding methods, making direct comparison more challenging. Our two-pass method ensures consistent comparisons across all system combination results presented in this study. We evaluate model pair candidates with varying architectures and label topologies and units. Experimental results are provided for the Librispeech 960h task.
Problem

Research questions and friction points this paper is trying to address.

Compare model combination across popular ASR architectures
Leverage complementary strengths of different models in search space
Ensure consistent comparisons in system combination results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines diverse ASR models' strengths
Rescores joint hypothesis list
Uses log-linear combination for best hypothesis
🔎 Similar Papers
No similar papers found.