Collaborative Speculative Inference for Efficient LLM Inference Serving

📅 2025-03-13
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing speculative decoding methods suffer from low resource utilization and insufficient draft token acceptance rates, leading to scalability and performance bottlenecks. To address these issues, this paper proposes a collaborative speculative inference framework. Its core contributions are: (1) a multi-expert draft generator routing mechanism that enables node-level specialization; (2) a confidence-driven token fusion strategy that enhances verification consistency; and (3) dynamic pipeline orchestration with adaptive batch scheduling, effectively decoupling draft generation from parallel verification. The framework supports distributed deployment and heterogeneous node collaboration. Evaluated under identical hardware resources, it achieves a 23.2% reduction in end-to-end latency and a 32.5% increase in throughput compared to baseline methods, significantly outperforming state-of-the-art speculative inference approaches.

Technology Category

Application Category

📝 Abstract
Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Reduces LLM inference latency and costs
Improves resource utilization and draft acceptance
Enhances scalability and overall inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples speculative decoding from verification
Uses confidence-based token fusion mechanism
Dynamically orchestrates pipelined execution
🔎 Similar Papers
No similar papers found.