Stratos: An End-to-End Distillation Pipeline for Customized LLMs under Distributed Cloud Environments

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the challenges of manual intervention, poor customization, and suboptimal cost-efficiency in large language model (LLM) distillation within distributed cloud environments, this paper proposes an end-to-end automated distillation framework. The method jointly optimizes server selection, teacher-student model pairing, and distillation strategy based on user-specified constraints (e.g., accuracy, latency, budget), leveraging Pareto-optimal server allocation, dynamic teacher-student matching, and task-adaptive distillation. It integrates knowledge distillation, reverse synthetic data generation, and knowledge injection, guided by a resource–task complexity co-optimization algorithm. Evaluated on a Mahjong reasoning task, the distilled student model achieves 4× higher accuracy than GPT-4o while significantly reducing inference latency and deployment costs. This demonstrates the framework’s effectiveness and practicality for efficient, domain-specific LLM customization and deployment in distributed cloud settings.

Technology Category

Application Category

📝 Abstract

The growing industrial demand for customized and cost-efficient large language models (LLMs) is fueled by the rise of vertical, domain-specific tasks and the need to optimize performance under constraints such as latency and budget. Knowledge distillation, as an efficient model compression and transfer technique, offers a feasible solution. However, existing distillation frameworks often require manual intervention and struggle to meet such complex user-defined distillation requirements. To bridge this gap, we propose Stratos, an end-to-end LLM distillation pipeline that automates server and model selection, knowledge distillation, and deployment in distributed cloud environments. Given user-defined constraints on model performance and system budget, Stratos automatically selects Pareto-optimal servers, dynamically matches teacher-student pairs, and adapts distillation strategies based on task complexity to optimize cloud hosting. Experiments show that Stratos produces a student model that achieves four times the accuracy of its GPT-4o teacher baseline on a rare, domain-specific Mahjong reasoning task with reverse synthetic data and knowledge injection. Moreover, it achieves reduced latency and cost without compromising accuracy. These results highlight its promise for vertical-domain LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

Automates LLM distillation for customized models in distributed clouds

Optimizes server selection and distillation under user constraints

Enhances domain-specific task accuracy while reducing latency and cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates server selection and model deployment

Dynamically matches teacher-student model pairs

Adapts distillation strategies for task complexity

🔎 Similar Papers

No similar papers found.