Declarative Data Pipeline for Large Scale ML Services

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the joint optimization challenges of performance, maintainability, and collaborative efficiency in large-scale integrated machine learning within distributed data processing systems, this paper proposes Pipes—a declarative, modular data pipeline architecture. Pipes decomposes pipelines into logically encapsulated computation units, implemented atop Apache Spark with standardized interfaces and well-defined component boundaries—departing from conventional microservice paradigms to enable high-performance, maintainable ML pipeline development. In enterprise deployments, Pipes improves development efficiency by 50%, reduces collaborative debugging cycles from weeks to days, achieves 500× scalability, and delivers 10× higher throughput. Academic benchmarks show >5.7× throughput improvement and 99% CPU utilization. Its core contribution is the first deep integration of declarative abstractions with Spark’s native execution model, simultaneously advancing both development methodology and system performance.

Technology Category

Application Category

📝 Abstract
Modern distributed data processing systems face significant challenges in balancing system performance with code maintainability and developer productivity, particularly when integrating machine learning capabilities at scale. In large collaborative environments, these challenges are amplified by high communication overhead between teams and the complexity of coordinating development across multiple groups. This paper presents a novel "Declarative Data Pipeline" architecture that addresses these challenges while processing billions of records with high accuracy and efficiency. Our architecture introduces a modular framework that seamlessly integrates machine learning capabilities within Apache Spark by combining logical computation units that we refer as Pipes, departing from traditional microservice-based approaches. By establishing clear component boundaries and standardized interfaces, we achieve both modularity and system optimization without sacrificing maintainability. The enterprise case study demonstrate substantial improvements in multiple dimensions: development efficiency improved by 50%, collaboration/troubleshooting efforts compressed from weeks to days, performance improved by 500x in scalability and by 10x in throughput. The academic experiment also proves at least 5.7x faster in throughput with 99% CPU utilization than non-framework implementations. This paper details the architectural decisions, implementation strategies, and performance optimizations that enable these improvements, providing insights for building scalable, maintainable data processing systems that effectively balance system performance with development velocity.
Problem

Research questions and friction points this paper is trying to address.

Balancing performance with maintainability in large-scale ML services
Reducing communication overhead in collaborative data processing environments
Integrating machine learning capabilities efficiently within Apache Spark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Declarative modular framework integrating ML with Spark
Pipes-based architecture replacing traditional microservice approaches
Standardized interfaces enabling optimization without sacrificing maintainability
🔎 Similar Papers
No similar papers found.
Y
Yunzhao Yang
Amazon Web Services, Seattle, USA
Runhui Wang
Runhui Wang
Rutgers University
Natural Language ProcessingEntity ResolutionData Mining
Xuanqing Liu
Xuanqing Liu
Amazon Web Services, Seattle, USA
Adit Krishnan
Adit Krishnan
Applied Scientist, Amazon
robust_mlinformation search and retrievalmultimodal information retrievalmultimodal data mining
Y
Yefan Tao
Amazon Web Services, Seattle, USA
Y
Yuqian Deng
Amazon Web Services, Seattle, USA
K
Kuangyou Yao
Amazon Web Services, Seattle, USA
P
Peiyuan Sun
Amazon Web Services, Seattle, USA
H
Henrik Johnson
Amazon Web Services, Seattle, USA
A
Aditi sinha
Amazon Web Services, Seattle, USA
D
Davor Golac
Amazon Web Services, Seattle, USA
Gerald Friedland
Gerald Friedland
Faculty UC Berkeley, Principal Scientist Amazon AWS
multimedia computingAutoMLspeaker diarizationprivacy
U
Usman Shakeel
Amazon Web Services, Seattle, USA
D
Daryl Cooke
Amazon Web Services, Seattle, USA
Joe Sullivan
Joe Sullivan
Amazon Web Services, Seattle, USA
Chris Kong
Chris Kong
Principal Scientist @ AWS
NLP