🤖 AI Summary
To address the performance bottlenecks of private small language models (SLMs) in resource-constrained settings, this paper proposes a process-reward-driven dynamic collaborative inference framework. The method avoids fine-tuning general-purpose large language models (LLMs) or exposing private data, instead enabling lightweight, controllable joint inference between SLMs and LLMs via four key components: (i) modeling of step-wise process rewards, (ii) adaptive collaborative decoding, (iii) API-aware scheduling, and (iv) heterogeneous architecture design. Evaluated across multiple benchmark tasks, the private SLM achieves substantial performance gains—matching or even surpassing the standalone performance of general-purpose LLMs—while reducing inference cost by over 30%. The core contribution lies in the first introduction of a process reward mechanism into SLM–LLM collaborative inference, uniquely balancing efficiency, privacy preservation, and inference controllability.
📝 Abstract
Due to the limited computational resources, most Large Language Models (LLMs) developers can only fine-tune Small Language Models (SLMs) on their own data. These private SLMs typically have limited effectiveness. To boost the performance of private SLMs, this paper proposes to ask general LLMs for help. The general LLMs can be APIs or larger LLMs whose inference cost the developers can afford. Specifically, we propose the G-Boost framework where a private SLM adaptively performs collaborative inference with a general LLM under the guide of process reward. Experiments demonstrate that our framework can significantly boost the performance of private SLMs.