A distributed frequent itemset mining algorithm using Spark for Big Data analytics

作者:Zhang, Feng; Liu, Min; Gui, Feng; Shen, Weiming; Shami, Abdallah; Ma, Yunlong*
来源:Cluster Computing, 2015, 18(4): 1493-1501.
DOI:10.1007/s10586-015-0477-1

摘要

Frequent itemset mining is an essential step in the process of association rule mining. Conventional approaches for mining frequent itemsets in big data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm (DFIMA) which can significantly reduce the amount of candidate itemsets by applying a matrix-based pruning approach. The proposed algorithm has been implemented using Spark to further improve the efficiency of iterative computation. Numeric experiment results using standard benchmark datasets by comparing the proposed algorithm with the existing algorithm, parallel FP-growth, show that DFIMA has better efficiency and scalability. In addition, a case study has been carried out to validate the feasibility of DFIMA.