🤖 AI Summary
This work addresses the fundamental trade-off among latency, memory consumption, and translation quality in streaming sequence-to-sequence tasks—particularly automatic speech recognition (ASR) and simultaneous speech translation. To this end, we propose STAR, a novel streaming Transformer architecture. Its core innovations are: (i) a dynamic streaming segmentation mechanism that replaces fixed-size windows or hard truncation with learnable, adaptive segment boundaries; and (ii) anchor representation learning, jointly optimized with streaming attention masking to efficiently compress historical context within the Transformer framework. Experiments demonstrate that STAR achieves near-lossless 12× compression in ASR, while in simultaneous speech translation it reduces average latency by 37%, cuts memory usage by 52%, and lowers word error rate (WER) by 8.3% relatively—substantially outperforming existing streaming approaches.
📝 Abstract
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.