Stratos: An End-to-End Distillation Pipeline for Customized LLMs under Distributed Cloud Environments

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of manual intervention, poor customization, and suboptimal cost-efficiency in large language model (LLM) distillation within distributed cloud environments, this paper proposes an end-to-end automated distillation framework. The method jointly optimizes server selection, teacher-student model pairing, and distillation strategy based on user-specified constraints (e.g., accuracy, latency, budget), leveraging Pareto-optimal server allocation, dynamic teacher-student matching, and task-adaptive distillation. It integrates knowledge distillation, reverse synthetic data generation, and knowledge injection, guided by a resource–task complexity co-optimization algorithm. Evaluated on a Mahjong reasoning task, the distilled student model achieves 4× higher accuracy than GPT-4o while significantly reducing inference latency and deployment costs. This demonstrates the framework’s effectiveness and practicality for efficient, domain-specific LLM customization and deployment in distributed cloud settings.

Technology Category

Application Category

📝 Abstract
The growing industrial demand for customized and cost-efficient large language models (LLMs) is fueled by the rise of vertical, domain-specific tasks and the need to optimize performance under constraints such as latency and budget. Knowledge distillation, as an efficient model compression and transfer technique, offers a feasible solution. However, existing distillation frameworks often require manual intervention and struggle to meet such complex user-defined distillation requirements. To bridge this gap, we propose Stratos, an end-to-end LLM distillation pipeline that automates server and model selection, knowledge distillation, and deployment in distributed cloud environments. Given user-defined constraints on model performance and system budget, Stratos automatically selects Pareto-optimal servers, dynamically matches teacher-student pairs, and adapts distillation strategies based on task complexity to optimize cloud hosting. Experiments show that Stratos produces a student model that achieves four times the accuracy of its GPT-4o teacher baseline on a rare, domain-specific Mahjong reasoning task with reverse synthetic data and knowledge injection. Moreover, it achieves reduced latency and cost without compromising accuracy. These results highlight its promise for vertical-domain LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

Automates LLM distillation for customized models in distributed clouds
Optimizes server selection and distillation under user constraints
Enhances domain-specific task accuracy while reducing latency and cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates server selection and model deployment
Dynamically matches teacher-student model pairs
Adapts distillation strategies for task complexity
🔎 Similar Papers
No similar papers found.
Ziming Dai
Ziming Dai
Master, Tianjin University
distributed learning
T
Tuo Zhang
University of Southern California, Los Angeles, United States
F
Fei Gao
College of Intelligence and Computing, Tianjin University, Tianjin, China
X
Xingyi Cai
College of Intelligence and Computing, Tianjin University, Tianjin, China
X
Xiaofei Wang
College of Intelligence and Computing, Tianjin University, Tianjin, China
C
Cheng Zhang
Institute of Technology, Tianjin University of Finance and Economics, Tianjin, China
W
Wenyu Wang
Paiou Cloud Computing (Shanghai) Co., Ltd, Shanghai, China
C
Chengjie Zang
Paiou Cloud Computing (Shanghai) Co., Ltd, Shanghai, China