🤖 AI Summary
To address excessive computational resource overhead in large language model (LLM) inference, this work proposes a novel collaborative inference paradigm that— for the first time—employs a lightweight Tiny model as a real-time computation offloader dynamically coordinated with an LLM. Our approach integrates three key components: (1) model-aware collaborative scheduling, (2) knowledge-distillation-guided lightweight architecture design, and (3) dynamic computational load splitting—all jointly optimized to intelligently redistribute inference workload while preserving accuracy. The core contribution is achieving Pareto-optimal trade-offs between accuracy and efficiency: on mainstream LLM inference tasks, our method reduces GPU memory consumption by 47%, decreases end-to-end latency by 39%, and incurs negligible accuracy degradation—strictly bounded within 0.8%. This framework establishes a scalable, cost-effective pathway for practical LLM deployment.