🤖 AI Summary
This work addresses the challenge of meeting deadlines while minimizing costs for concurrent streaming queries in cluster environments under dynamic workloads and unpredictable query arrivals. The paper proposes an intermittent query scheduling framework tailored for elastic parallel execution, which, to the best of our knowledge, is the first to integrate elastic resource provisioning into intermittent query processing. By dynamically scaling cluster nodes up or down, the approach simultaneously satisfies windowed query deadlines and reduces resource expenditure. Implemented on Apache Spark and deployed on AWS EMR, the system combines elastic computing with batch-oriented scheduling algorithms. Experimental evaluations on TPC-H and Yahoo Streaming Benchmark datasets demonstrate significant improvements over both static configurations and Spark Streaming, achieving superior timeliness and cost efficiency.
📝 Abstract
Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.