Tiny Models are the Computational Saver for Large Models

📅 2024-03-26
🏛️ European Conference on Computer Vision
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address excessive computational resource overhead in large language model (LLM) inference, this work proposes a novel collaborative inference paradigm that— for the first time—employs a lightweight Tiny model as a real-time computation offloader dynamically coordinated with an LLM. Our approach integrates three key components: (1) model-aware collaborative scheduling, (2) knowledge-distillation-guided lightweight architecture design, and (3) dynamic computational load splitting—all jointly optimized to intelligently redistribute inference workload while preserving accuracy. The core contribution is achieving Pareto-optimal trade-offs between accuracy and efficiency: on mainstream LLM inference tasks, our method reduces GPU memory consumption by 47%, decreases end-to-end latency by 39%, and incurs negligible accuracy degradation—strictly bounded within 0.8%. This framework establishes a scalable, cost-effective pathway for practical LLM deployment.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

AI model optimization
computational resource reduction
performance maintenance
Innovation

Methods, ideas, or system contributions that make the work stand out.

TinySaver
Efficient Computing
Model Efficiency
Q
Qingyuan Wang
University College Dublin, Ireland
B
B. Cardiff
University College Dublin, Ireland
A
Antoine Frappé
Univ. Lille, CNRS, Centrale Lille, Junia, Univ. Polytechnique Hauts-de-France, UMR 8520-IEMN, France
B
Benoît Larras
Univ. Lille, CNRS, Centrale Lille, Junia, Univ. Polytechnique Hauts-de-France, UMR 8520-IEMN, France
Deepu John
Deepu John
University College Dublin
Edge ComputingIoTWearable SensingBiomedical Circuits and Systems