Streaming Sequence Transduction through Dynamic Compression

📅 2024-02-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the fundamental trade-off among latency, memory consumption, and translation quality in streaming sequence-to-sequence tasks—particularly automatic speech recognition (ASR) and simultaneous speech translation. To this end, we propose STAR, a novel streaming Transformer architecture. Its core innovations are: (i) a dynamic streaming segmentation mechanism that replaces fixed-size windows or hard truncation with learnable, adaptive segment boundaries; and (ii) anchor representation learning, jointly optimized with streaming attention masking to efficiently compress historical context within the Transformer framework. Experiments demonstrate that STAR achieves near-lossless 12× compression in ASR, while in simultaneous speech translation it reduces average latency by 37%, cuts memory usage by 52%, and lowers word error rate (WER) by 8.3% relatively—substantially outperforming existing streaming approaches.

Technology Category

Application Category

📝 Abstract

We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.

Problem

Research questions and friction points this paper is trying to address.

Efficient sequence transduction for streaming input

Dynamic compression with minimal loss in ASR

Optimizing latency and quality in speech-to-text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for stream transduction

Dynamic segmentation with anchor representations

Nearly lossless 12x compression in ASR

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models