计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (1): 120-136.DOI: 10.3778/j.issn.1673-9418.2008068

• 数据库技术 • 上一篇    下一篇

采用N-list结构的混合并行频繁项集挖掘算法

刘卫明, 张弛, 毛伊敏+()   

  1. 江西理工大学 信息工程学院,江西 赣州 341000
  • 收稿日期:2020-08-21 修回日期:2020-11-03 出版日期:2022-01-01 发布日期:2020-11-19
  • 通讯作者: + E-mail: mymlyc@163.com
  • 作者简介:刘卫明(1964-),男,江西新余人,教授,硕士生导师,主要研究方向为数据挖掘、大数据。
    张弛(1994-),男,河南洛阳人,硕士研究生,主要研究方向为数据挖掘、大数据。
    毛伊敏(1970-),女,新疆伊犁人,博士,教 授,硕士生导师,主要研究方向为数据挖掘、大数据。
  • 基金资助:
    国家重点研发计划(2018YFC1504705);国家自然科学基金(41562019);江西省教育厅科技项目(GJJ151528);江西省教育厅科技项目(GJJ151531)

Hybrid Parallel Frequent Itemsets Mining Algorithm by Using N-List Structure

LIU Weiming, ZHANG Chi, MAO Yimin+()   

  1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
  • Received:2020-08-21 Revised:2020-11-03 Online:2022-01-01 Published:2020-11-19
  • About author:LIU Weiming, born in 1964, professor, M.S. supervisor. His research interests include data mining and big data.
    ZHANG Chi, born in 1994, M.S. candidate. His research interests include data mining and big data.
    MAO Yimin, born in 1970, Ph.D., professor, M.S. supervisor. Her research interests include data mining and big data.
  • Supported by:
    National Key Research and Development Program of China(2018YFC1504705);National Natural Science Foundation of China(41562019);Science and Technology Project of Jiangxi Provincial Education Department(GJJ151528);Science and Technology Project of Jiangxi Provincial Education Department(GJJ151531)

摘要:

针对大数据环境下并行MRPrePost频繁项集挖掘算法中存在计算节点负载不均衡,N-list合并效率低以及冗余搜索等问题,提出了基于N-list结构的混合并行频繁项集挖掘算法HP-FIMBN。首先,设计负载量估计函数(LE)来计算出频繁1项集F-list中每一项的负载量,同时提出基于贪心策略的分组方法(GM-GS)将F-list中的每一项根据其负载量进行均匀分组,既解决了数据划分中计算节点负载不均衡的问题,又降低了集群中各节点上子PPC-Tree树的规模;其次,提出预先放弃策略(EAS),该策略不仅能有效避免合并过程中的无效计算,而且不需要遍历初始N-list结构就能得到最终的N-list,极大地提高了N-list结构的合并效率;最后,采用集合枚举树作为搜索空间,并提出超集等价剪枝策略(SES)来避免挖掘过程中的冗余搜索,生成最终的挖掘结果。实验结果表明,该算法在大数据环境下进行频繁项集挖掘具有较好的效果。

关键词: 频繁项集挖掘, N-list结构, 贪心策略, 集合枚举树, 超集等价剪枝策略(SES)

Abstract:

Aiming at the problem of unbalanced load, bad efficiency of N-list merge and redundant search for each node based on parallel frequent itemset mining algorithm MRPrePost (parallel PrePost algorithm based on MapReduce), this paper proposes a hybrid parallel frequent itemset mining algorithm based on N-list (HP-FIMBN). Firstly, a load estimation function (LE) is designed to calculate the load of each item in F-list, and a grouping method based on greedy strategy (GM-GS) is proposed, which not only deals with the problem of unbalanced load in the process of data partitioning, but also decreases the size of sub-PPC-tree on local node. Secondly, in order to accelerate the completion of the merging of two N-list structures, it designs an early abandon strategy (EAS), which can not only efficiently avoid invalid calculations in the merging process, but also does not need to traverse the initial N-list structure to get the final result. Finally, the set-enumeration tree is adopted as the search space, a superset equivalent strategy (SES) is proposed to avoid redundant search in the mining process and the final mining results are generated. Experimental results show that the modified algorithm has better performance on mining frequent itemsets in big data environment.

Key words: frequent item mining, N-list, greedy strategy, set-enumeration tree, superset equivalent strategy (SES)

中图分类号: