🤖 AI Summary
To address privacy leakage, high latency, and resource constraints in deploying large language models (LLMs) in cloud environments—as well as the high cost of centralized customization and performance imbalance arising from user and data heterogeneity—this paper proposes a bidirectional single-loop distributed adaptive customization framework. The framework introduces a novel paradigm that jointly optimizes model architecture and data distribution, integrating Pareto-optimal model selection, data-distribution-aware personalized head generation, and distributed architecture aggregation. This enables fine-grained, low-communication-overhead personalized model customization. Experimental results demonstrate that the proposed method reduces model transmission volume to just 6% of that required by centralized approaches, improves average accuracy by 10%, and enhances the overall trade-off metric by 29.7%, significantly outperforming existing baseline methods.
📝 Abstract
Pre-trained Transformer-based large models have revolutionized personal virtual assistants, but their deployment in cloud environments faces challenges related to data privacy and response latency. Deploying large models closer to the data and users has become a key research area to address these issues. However, applying these models directly often entails significant difficulties, such as model mismatching, resource constraints, and energy inefficiency. Automated design of customized models is necessary, but it faces three key challenges, namely, the high cost of centralized model customization, imbalanced performance from user heterogeneity, and suboptimal performance from data heterogeneity. In this paper, we propose ACME, an adaptive customization approach of Transformer-based large models via distributed systems. To avoid the low cost-efficiency of centralized methods, ACME employs a bidirectional single-loop distributed system to progressively achieve fine-grained collaborative model customization. In order to better match user heterogeneity, it begins by customizing the backbone generation and identifying the Pareto Front under model size constraints to ensure optimal resource utilization. Subsequently, it performs header generation and refines the model using data distribution-based personalized architecture aggregation to match data heterogeneity. Evaluation on different datasets shows that ACME achieves cost-efficient models under model size constraints. Compared to centralized systems, data transmission volume is reduced to 6 percent. Additionally, the average accuracy improves by 10 percent compared to the baseline, with the trade-off metrics increasing by nearly 30 percent.