🤖 AI Summary
This work addresses the high inference cost and latency of large language models alongside the limited capabilities of small models, noting that existing collaborative approaches operate at coarse granularity and often waste computational resources. The authors propose a token-level collaborative decoding framework in which a small model dynamically invokes a large model only at critical tokens, introducing a novel “relay” mechanism for fine-grained control. A two-stage training strategy—comprising a warm-up phase followed by Grouped Relative Policy Optimization (GRPO)—enables the small model to balance autonomous generation with strategic求助. Experiments show that the method achieves an average accuracy of 49.52% across six benchmarks, with only 1.07% of generated tokens requiring large-model intervention, reducing inference costs by 98.2% compared to random routing schemes.
📝 Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively"relaying"the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.