RelayLLM: Efficient Reasoning via Collaborative Decoding

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference cost and latency of large language models alongside the limited capabilities of small models, noting that existing collaborative approaches operate at coarse granularity and often waste computational resources. The authors propose a token-level collaborative decoding framework in which a small model dynamically invokes a large model only at critical tokens, introducing a novel “relay” mechanism for fine-grained control. A two-stage training strategy—comprising a warm-up phase followed by Grouped Relative Policy Optimization (GRPO)—enables the small model to balance autonomous generation with strategic求助. Experiments show that the method achieves an average accuracy of 49.52% across six benchmarks, with only 1.07% of generated tokens requiring large-model intervention, reducing inference costs by 98.2% compared to random routing schemes.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively"relaying"the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Small Language Models
Efficient Reasoning
Collaborative Decoding
Computational Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative decoding
token-level routing
small language models
Group Relative Policy Optimization
efficient reasoning
🔎 Similar Papers
No similar papers found.