Frequent Itemset Mining (FIM) is one of most fundamental techniques in data mining with extensive applications to a variety of data mining problems such as association rule mining, correlations, clustering and classification. Since the first proposal of frequent itemset mining, numerous serial algorithms have been proposed in order to improve mining performance, yet most of them cannot scale to massive datasets which are very common nowadays. In this paper, we propose a new parallel FIM algorithm named PFIN based on Nodeset which is a more efficient data structure for mining frequent itemsets. PFIN can intelligently decompose a large-scale FIM problem into a set of tasks, where each task can be executed in parallel without unnecessary communication overheads. Moreover, a hash-based load balancing strategy has been adopted to optimize resource use and maximize throughput. For evaluating the performance of PFIN, we have conduct extensive experiments on Spark which is an emerging distributed in-memory processing framework to compare it against PFP which is one of state-of-the-art parallel FIM algorithms on a range of real datasets. The experimental results demonstrate that our proposed PFIN are highly competitive with PFP in scalability performance, outperforming PFP in speed performance.
목차
Abstract 1. Introduction 2. Background 2.1. Frequent Itemset Mining 2.2. Apache Spark Framework 3. Related Work 4. PFIN: The Proposed Method 4.1. FIN Algorithm 4.2. PFIN Outline 4.3. Parallel Counting 4.4. Grouping Items with Load Balancing 4.5. Generating Conditional Transactions 4.6. Constructing Local POC-trees and Generating the Nodesets of 2-itemset 4.7. Constructing Local Set-enumeration Trees and Mining Frequent Itemsets 5. Experiments 5.1. Experiment Setup 5.2. Speed Performance Analysis 5.3. Scalability Performance Analysis 6. Conclusion Reference
키워드
data miningfrequent itemset miningdistributed computingspark
저자
Chen Lin [ Department of Computer Science and Technology East China Normal University ]
Junzhong Gu [ Department of Computer Science and Technology East China Normal University ]
보안공학연구지원센터(IJDTA) [Science & Engineering Research Support Center, Republic of Korea(IJDTA)]
설립연도
2006
분야
공학>컴퓨터학
소개
1. 보안공학에 대한 각종 조사 및 연구
2. 보안공학에 대한 응용기술 연구 및 발표
3. 보안공학에 관한 각종 학술 발표회 및 전시회 개최
4. 보안공학 기술의 상호 협조 및 정보교환
5. 보안공학에 관한 표준화 사업 및 규격의 제정
6. 보안공학에 관한 산학연 협동의 증진
7. 국제적 학술 교류 및 기술 협력
8. 보안공학에 관한 논문지 발간
9. 기타 본 회 목적 달성에 필요한 사업
간행물
간행물명
International Journal of Database Theory and Application
간기
격월간
pISSN
2005-4270
수록기간
2008~2016
십진분류
KDC 505DDC 605
이 권호 내 다른 논문 / International Journal of Database Theory and Application Vol.9 No.6