Adaptive Scheduling Parallel Jobs with Dynamic Batching in Spark Streaming

作者:Cheng, Dazhao*; Zhou, Xiaobo; Wang, Yu; Jiang, Changjun
来源:IEEE Transactions on Parallel and Distributed Systems, 2018, 29(12): 2672-2685.
DOI:10.1109/TPDS.2018.2846234

摘要

Today enterprises have massive stream data that require to be processed in real time due to data explosion in recent years. Spark Streaming as an emerging system is developed to process real time stream data analytics by using micro-batch approach. The unified programming model of Spark Steaming leads to some unique benefits over other traditional streaming systems, such as fast recovery from failures, better load balancing and resource usage. It treats the continuous stream as a series of micro-batches of data and continuously process these micro-batch jobs. However, efficient scheduling of micro-batch jobs to achieve high throughput and low latency is very challenging due to the complex data dependency and dynamism inherent in streaming workloads. In this paper, we propose A-scheduler, an adaptive scheduling approach that dynamically schedules parallel micro-batch jobs in Spark Streaming and automatically adjusts scheduling parameters to improve performance and resource efficiency. Specifically, A-scheduler dynamically schedules multiple jobs concurrently using different policies based on their data dependencies and automatically adjusts the level of job parallelism and resource shares among jobs based on workload properties. Furthermore, we integrate dynamic batching technique with A-Scheduler to further improve the overall performance of the customized Spark Streaming system. It relies on an expert fuzzy control mechanism to dynamically adjust the length of each batch interval in response to time-varying streaming workload and system processing rate. We implemented A-scheduler and evaluated it with a real-time security event processing workload. Our experimental results show that A-scheduler with dynamic batching can reduce end-to-end latency by 38 percent and meanwhile improve workload throughput and energy efficiency by 23 and 15 percent, respectively, compared to the default Spark Streaming scheduler.