Cost-Performance Analysis: A Comparative Study of CPU-Based Serverless and GPU-Based Training Architectures

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the challenge of balancing scalability and cost-efficiency in large-model distributed training, this paper proposes SPIRT—a novel distributed machine learning framework built on a serverless CPU architecture. SPIRT integrates parallel batch processing with in-library computation via RedisAI, jointly optimizing communication overhead and computational scheduling. It incorporates collective communication protocols (e.g., ScatterReduce and AllReduce) and a dynamic fault-tolerance mechanism. Experiments demonstrate that SPIRT significantly reduces communication volume and training time compared to GPU-based clusters and state-of-the-art baselines such as MLLess, achieving lower total cost for long-running training jobs while exhibiting superior fault recovery capability. To our knowledge, this is the first work to empirically validate high-throughput, low-communication, and robust distributed training entirely on a CPU-based serverless infrastructure—establishing a new paradigm for economically efficient next-generation distributed training systems.

Technology Category

Application Category

📝 Abstract

The field of distributed machine learning (ML) faces increasing demands for scalable and cost-effective training solutions, particularly in the context of large, complex models. Serverless computing has emerged as a promising paradigm to address these challenges by offering dynamic scalability and resource-efficient execution. Building upon our previous work, which introduced the Serverless Peer Integrated for Robust Training (SPIRT) architecture, this paper presents a comparative analysis of several serverless distributed ML architectures. We examine SPIRT alongside established architectures like ScatterReduce, AllReduce, and MLLess, focusing on key metrics such as training time efficiency, cost-effectiveness, communication overhead, and fault tolerance capabilities. Our findings reveal that SPIRT provides significant improvements in reducing training times and communication overhead through strategies such as parallel batch processing and in-database operations facilitated by RedisAI. However, traditional architectures exhibit scalability challenges and varying degrees of vulnerability to faults and adversarial attacks. The cost analysis underscores the long-term economic benefits of SPIRT despite its higher initial setup costs. This study not only highlights the strengths and limitations of current serverless ML architectures but also sets the stage for future research aimed at developing new models that combine the most effective features of existing systems.

Problem

Research questions and friction points this paper is trying to address.

Comparing cost-effectiveness of serverless and GPU training architectures

Analyzing communication overhead and fault tolerance in ML systems

Evaluating scalability challenges in distributed machine learning solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Serverless Peer Integrated for Robust Training architecture

Parallel batch processing with RedisAI operations

Comparative analysis of cost-performance training architectures

🔎 Similar Papers

No similar papers found.