VoxServe: Streaming-Centric Serving System for Speech Language Models

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing systems struggle to efficiently support diverse speech language models (SLMs) in streaming scenarios while simultaneously achieving low latency, high throughput, and strong streaming capabilities. This work proposes a unified streaming serving framework that introduces a model-execution decoupling abstraction layer, enabling flexible support for multiple SLM architectures within a single system for the first time. Furthermore, the framework incorporates streaming-aware scheduling and an asynchronous inference pipeline to achieve end-to-end system-level optimization. Experimental results demonstrate that the proposed approach achieves 10–20× higher throughput than existing systems across several mainstream speech language models, while maintaining comparable latency and high streaming feasibility.

Technology Category

Application Category

📝 Abstract

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

Problem

Research questions and friction points this paper is trying to address.

Speech Language Models

streaming serving

low latency

high throughput

streamability

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming-centric serving

model-execution abstraction

asynchronous inference pipeline