PDSP-Bench: A Benchmarking System for Parallel and Distributed Stream Processing

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing performance evaluation methodologies for parallel stream processing systems in distributed heterogeneous environments lack systematicity, particularly in modeling operator-level parallelism alongside heterogeneous resource constraints. Method: We propose the first benchmarking framework that jointly models operator-level parallelism and heterogeneous resources. Built upon Apache Flink, it supports dynamic ML-driven workload generation, multi-granularity resource monitoring, pluggable cost model training, and declarative specification of parallel topologies. Contribution/Results: First, we quantitatively characterize nonlinear scaling effects and performance paradoxes in heterogeneous stream processing—phenomena previously unmeasured. Second, we introduce a nonlinear performance attribution analysis mechanism to isolate root causes of inefficiency. Third, we empirically evaluate multiple learning-based cost models, comparing their training efficiency and prediction accuracy. Experimental results demonstrate the framework’s effectiveness in modeling and optimizing modern large-scale streaming workloads under realistic heterogeneous deployments.

Technology Category

Application Category

📝 Abstract

The paper introduces PDSP-Bench, a novel benchmarking system designed for a systematic understanding of performance of parallel stream processing in a distributed environment. Such an understanding is essential for determining how Stream Processing Systems (SPS) use operator parallelism and the available resources to process massive workloads of modern applications. Existing benchmarking systems focus on analyzing SPS using queries with sequential operator pipelines within a homogeneous centralized environment. Quite differently, PDSP-Bench emphasizes the aspects of parallel stream processing in a distributed heterogeneous environment and simultaneously allows the integration of machine learning models for SPS workloads. In our results, we benchmark a well-known SPS, Apache Flink, using parallel query structures derived from real-world applications and synthetic queries to show the capabilities of PDSP-Bench towards parallel stream processing. Moreover, we compare different learned cost models using generated SPS workloads on PDSP-Bench by showcasing their evaluations on model and training efficiency. We present key observations from our experiments using PDSP-Bench that highlight interesting trends given different query workloads, such as non-linearity and paradoxical effects of parallelism on the performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluates performance of parallel stream processing in distributed environments

Integrates machine learning models for Stream Processing Systems workloads

Benchmarks Apache Flink with real-world and synthetic parallel queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks parallel stream processing in distributed environments

Integrates machine learning models for workload analysis

Evaluates performance with real-world and synthetic queries

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Performance Engineer

Anthropic

$280,000—$850,000 USD

San Francisco, CA, USA

Research Scientist, AI & Systems Co-design (PhD)