Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (5): 1064-1075.DOI: 10.3778/j.issn.1673-9418.2010064

• Database Technology • Previous Articles     Next Articles

Improved Parallel Random Forest Algorithm Combining Information Theory and Norm

MAO Yimin(), GENG Junhao   

  1. School of Information Engineering, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China
  • Received:2020-10-26 Revised:2021-01-22 Online:2022-05-01 Published:2022-05-19
  • About author:MAO Yimin, born in 1970, Ph.D., professor, M.S. supervisor. Her research interests include data mining, big data, etc.
    GENG Junhao, born in 1997, M.S. candidate. His research interests include data mining, big data, etc.
  • Supported by:
    National Key Research and Development Program of China(2018YFC1504705);National Natural Science Foundation of China(41562019);Science and Technology Foundation of Jiangxi Provincial Education Department(GJJ151528);Science and Technology Foundation of Jiangxi Provincial Education Department(GJJ151531)

结合信息论和范数的并行随机森林算法

毛伊敏(), 耿俊豪   

  1. 江西理工大学 信息工程学院,江西 赣州 341000
  • 通讯作者: + E-mail: mymlyc@163.com
  • 作者简介:毛伊敏(1970—),女,新疆伊犁人,博士,教授,硕士生导师,主要研究方向为数据挖掘、大数据等。
    耿俊豪(1997—),男,河南洛阳人,硕士研究生,主要研究方向为数据挖掘、大数据等。
  • 基金资助:
    国家重点研发计划(2018YFC1504705);国家自然科学基金(41562019);江西省教育厅科技项目(GJJ151528);江西省教育厅科技项目(GJJ151531)

Abstract:

Aiming at the problems of excessive redundancy and irrelevant features, low training feature information and low parallelization efficiency in big data random forest algorithm based on MapReduce, this paper proposes a parallel random forest algorithm based on information theory and norm (PRFITN). Firstly, the algorithm designs the DRIGFN (dimension reduction based on information gain and Frobenius norm) strategy to reduce the number of redundant and irrelevant features. Secondly, a feature grouping strategy based on information theory (FGSIT) is proposed. According to the FGSIT strategy, the features are grouped, and the stratified sampling method is adopted to ensure the information amount of the training features when constructing the decision tree in the random forest. Accuracy of classification results is improved. Finally, in order to improve the parallel efficiency of the cluster, the redistribution of key-value pairs (RSKP) is presented to realize the rapid and uniform distribution of key-value pairs, and obtain the global classification results. Experimental results show that the algorithm has better classification effect in big data environment, especially for datasets with more features.

Key words: MapReduce, random forest (RF), DRIGFN strategy, feature grouping strategy based on information theory (FGSIT), redistribution of key-value pairs (RSKP) strategy

摘要:

针对MapReduce框架下的随机森林算法在处理大数据问题时存在的冗余与不相关特征过多,训练特征信息量低以及并行化效率低等问题,提出了大数据下基于信息论和范数的并行随机森林算法(PRFITN)。首先,该算法基于信息增益和Frobenius范数设计了一种混合降维策略(DRIGFN),获得降维后的数据集,有效减少了冗余及不相关特征数;其次,提出了基于信息论的特征分组策略(FGSIT),根据FGSIT策略将特征分组,采用分层抽样方法,保证了随机森林中决策树构建时训练特征的信息量,提高了分类结果的准确度;最后,在Reduce阶段提出了一种键值对重分配策略(RSKP),获取全局的分类结果,实现了键值对的快速均匀分配,从而提高了集群的并行效率。实验结果表明,该算法在大数据环境下,尤其是针对特征数较多的数据集有更好的分类效果。

关键词: MapReduce框架, 随机森林(RF), DRIGFN策略, 基于信息论的特征分组策略(FGSIT), 键值对重分配策略(RSKP)

CLC Number: