On-device Streaming Discrete Speech Units

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Traditional discrete speech unit (DSU) extraction relies on full-sentence input and computationally intensive self-supervised speech models (S3Ms), making deployment on resource-constrained edge devices infeasible. This paper proposes the first lightweight framework for real-time, streaming DSU extraction—enabling genuine on-device streaming processing while preserving speech representation fidelity. Our approach jointly compresses both the attention window and model capacity, integrating a lightweight S3M, sliding-window feature extraction, streaming clustering, and efficient DSU encoding. Evaluated on the ML-SUPERB 1-hour benchmark, our method reduces FLOPs by 50% over baseline models while incurring only a 6.5% relative increase in character error rate (CER). The framework significantly improves real-time performance and energy efficiency on edge devices, establishing a new paradigm for on-device speech understanding.

Technology Category

Application Category

📝 Abstract

Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.

Problem

Research questions and friction points this paper is trying to address.

Reducing attention window for on-device streaming DSUs

Minimizing model size while preserving DSU effectiveness

Enabling real-time speech processing in resource-limited settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming discrete speech units (DSUs)

Reduced attention window and model size

50% FLOPs reduction with minimal CER increase

🔎 Similar Papers

No similar papers found.