ACE-Sync: An Adaptive Cloud-Edge Synchronization Framework for Communication-Efficient Large-Scale Distributed Model Training

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address high communication overhead and the difficulty of balancing convergence stability and accuracy in distributed large-model training under cloud-edge collaboration, this paper proposes ACE-Sync, an adaptive synchronization framework. Methodologically, it introduces: (1) an attention-driven gradient importance predictor for fine-grained parameter importance modeling; (2) a bandwidth-aware hierarchical synchronization mechanism that dynamically schedules synchronization granularity and compression intensity via knapsack optimization; and (3) a heterogeneous quantization/sparsification strategy integrating residual error compensation and device clustering. Experiments in representative cloud-edge heterogeneous environments demonstrate that ACE-Sync reduces total communication volume by 60% (from 112.5 GB to 44.7 GB), shortens convergence time by 5% in epochs, and achieves a Top-1 accuracy of 82.1%—only 0.3 percentage points below the full-synchronization baseline—thereby significantly improving communication efficiency and system scalability.

Technology Category

Application Category

📝 Abstract

Large-scale deep learning models impose substantial communication overh ead in distributed training, particularly in bandwidth-constrained or heterogeneous clo ud-edge environments. Conventional synchronous or fixed-compression techniques o ften struggle to balance communication cost, convergence stability, and model accura cy. To address these challenges, we propose ACE-Sync, an Adaptive Cloud-Edge Sy nchronization Framework that integrates (1) an attention-based gradient importance p redictor, (2) a differentiated parameter compression strategy, and (3) a hierarchical cl oud-edge coordination mechanism. ACE-Sync dynamically selects which parameter groups to synchronize and determines appropriate compression levels under per-devic e bandwidth budgets. A knapsack-based optimization strategy is adopted to maximize important gradient preservation while reducing redundant communication. Furthermo re, residual-based error compensation and device clustering ensure long-term converg ence and cross-device personalization. Experiments show that ACE-Sync substantiall y reduces communication overhead while maintaining competitive accuracy. Compar ed with FullSync, ACE-Sync lowers communication cost from 112.5 GB to 44.7 GB (a 60% reduction) and shortens convergence from 41 to 39 epochs. Despite aggressiv e communication reduction, ACE-Sync preserves high model quality, achieving 82. 1% Top-1 accuracy-only 0.3% below the full-synchronization baseline-demonstrating its efficiency and scalability for large-scale distributed training. These results indicate that ACE-Sync provides a scalable, communication-efficient, and accuracy-preservin g solution for large-scale cloud-edge distributed model training.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in distributed training for large-scale models

Balances communication cost with model accuracy and convergence stability

Optimizes synchronization in bandwidth-constrained cloud-edge environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based gradient importance predictor for adaptive synchronization

Differentiated parameter compression strategy under bandwidth budgets

Hierarchical cloud-edge coordination with knapsack optimization

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization