结合信息论和范数的并行随机森林算法

doi:10.3778/j.issn.1673-9418.2010064

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (5): 1064-1075.DOI: 10.3778/j.issn.1673-9418.2010064

结合信息论和范数的并行随机森林算法

毛伊敏(), 耿俊豪

江西理工大学信息工程学院,江西赣州 341000

收稿日期:2020-10-26 修回日期:2021-01-22 出版日期:2022-05-01 发布日期:2022-05-19
通讯作者: + E-mail: mymlyc@163.com
作者简介:毛伊敏（1970—）,女,新疆伊犁人,博士,教授,硕士生导师,主要研究方向为数据挖掘、大数据等。
耿俊豪（1997—）,男,河南洛阳人,硕士研究生,主要研究方向为数据挖掘、大数据等。
基金资助:
国家重点研发计划(2018YFC1504705);国家自然科学基金(41562019);江西省教育厅科技项目(GJJ151528);江西省教育厅科技项目(GJJ151531)

Improved Parallel Random Forest Algorithm Combining Information Theory and Norm

MAO Yimin(), GENG Junhao

School of Information Engineering, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China

Received:2020-10-26 Revised:2021-01-22 Online:2022-05-01 Published:2022-05-19
About author:MAO Yimin, born in 1970, Ph.D., professor, M.S. supervisor. Her research interests include data mining, big data, etc.
GENG Junhao, born in 1997, M.S. candidate. His research interests include data mining, big data, etc.
Supported by:
National Key Research and Development Program of China(2018YFC1504705);National Natural Science Foundation of China(41562019);Science and Technology Foundation of Jiangxi Provincial Education Department(GJJ151528);Science and Technology Foundation of Jiangxi Provincial Education Department(GJJ151531)

摘要/Abstract

摘要：

针对MapReduce框架下的随机森林算法在处理大数据问题时存在的冗余与不相关特征过多,训练特征信息量低以及并行化效率低等问题,提出了大数据下基于信息论和范数的并行随机森林算法（PRFITN）。首先,该算法基于信息增益和Frobenius范数设计了一种混合降维策略（DRIGFN）,获得降维后的数据集,有效减少了冗余及不相关特征数;其次,提出了基于信息论的特征分组策略（FGSIT）,根据FGSIT策略将特征分组,采用分层抽样方法,保证了随机森林中决策树构建时训练特征的信息量,提高了分类结果的准确度;最后,在Reduce阶段提出了一种键值对重分配策略（RSKP）,获取全局的分类结果,实现了键值对的快速均匀分配,从而提高了集群的并行效率。实验结果表明,该算法在大数据环境下,尤其是针对特征数较多的数据集有更好的分类效果。

关键词: MapReduce框架, 随机森林（RF）, DRIGFN策略, 基于信息论的特征分组策略（FGSIT）, 键值对重分配策略（RSKP）

Abstract:

Aiming at the problems of excessive redundancy and irrelevant features, low training feature information and low parallelization efficiency in big data random forest algorithm based on MapReduce, this paper proposes a parallel random forest algorithm based on information theory and norm (PRFITN). Firstly, the algorithm designs the DRIGFN (dimension reduction based on information gain and Frobenius norm) strategy to reduce the number of redundant and irrelevant features. Secondly, a feature grouping strategy based on information theory (FGSIT) is proposed. According to the FGSIT strategy, the features are grouped, and the stratified sampling method is adopted to ensure the information amount of the training features when constructing the decision tree in the random forest. Accuracy of classification results is improved. Finally, in order to improve the parallel efficiency of the cluster, the redistribution of key-value pairs (RSKP) is presented to realize the rapid and uniform distribution of key-value pairs, and obtain the global classification results. Experimental results show that the algorithm has better classification effect in big data environment, especially for datasets with more features.

Key words: MapReduce, random forest (RF), DRIGFN strategy, feature grouping strategy based on information theory (FGSIT), redistribution of key-value pairs (RSKP) strategy

中图分类号:

TP311

毛伊敏, 耿俊豪. 结合信息论和范数的并行随机森林算法[J]. 计算机科学与探索, 2022, 16(5): 1064-1075.

MAO Yimin, GENG Junhao. Improved Parallel Random Forest Algorithm Combining Information Theory and Norm[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1064-1075.

图/表 7

图1 RSKP策略

Fig.1 RSKP strategy

图2 并行构建随机森林

Fig.2 Construct random forests in parallel

表1 实验中节点的配置

Table 1 Configuration of nodes in experiment

主机名	IP地址	角色
Master	192.168.1.109	Master/JobTracker/NameNode
Slaver_1	192.168.1.110	Slaver/TaskTracker/DateNode
Slaver_2	192.168.1.111	Slaver/TaskTracker/DateNode
Slaver_3	192.168.1.112	Slaver/TaskTracker/DateNode

表2 实验数据集

Table 2 Experimental datasets

数据集	样本数/条	属性数/种	大小/MB
Farm Ads	1 692 082	5 267 656	1 481.9
Susy	990 002	41 270	32.1
APS Failure at Scania Trucks	5 000 000	190	321.0

图3 PRFITN算法的性能分析

Fig.3 Performance analysis of PRFITN algorithm

图4 五种算法在不同数据集上的运行时间

Fig.4 Running time of five algorithms on different datasets

图5 四种算法在不同数据集上的准确度

Fig.5 Accuracy of four algorithms on different datasets

参考文献 19

[1]	杨剑锋, 乔佩蕊, 李永梅, 等. 机器学习分类问题及算法研究综述[J]. 统计与决策, 2019, 35(6): 36-40.
	YANG J F, QIAO P R, LI Y M, et al. A review of machine-learning classification and algorithms[J]. Statistics & Decision, 2019, 35(6): 36-40.
[2]	厉柏伸, 李领治, 孙涌, 等. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157-162.
	LI B S, LI L Z, SUN Y, et al. Internet defense algorithm based on pseudo Boosting decision tree[J]. Computer Science, 2018, 45(4): 157-162.
[3]	SALLES T, GONCALVES M, RODRIGUES V, et al. Improving random forests by neighborhood projection for effective text classification[J]. Information Systems, 2018, 77(9): 1-21. DOI URL
[4]	YAN L, DIAO Y, GAO K. Analysis of environmental factors affecting the atmospheric corrosion rate of low-alloy steel using random forest-based models[J]. Materials, 2020, 13(15): 3266. DOI URL
[5]	周永圣, 崔佳丽, 周琳云, 等. 基于改进的随机森林模型的个人信用风险评估研究[J]. 征信, 2020, 38(1): 28-32.
	ZHOU Y S, CUI J L, ZHOU L Y, et al. Study on the evaluation of personal credit risk based on the improved random forest model[J]. Credit Reference, 2020, 38(1): 28-32.
[6]	BOULESTEIX A L, JANITZA S, KRUPPA J, et al. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics[J]. Wiley Interdisciplinary Reviews: Data Mining and Know-ledge Discovery, 2012, 2(6): 493-507.
[7]	ELYAN E, GABER M M. A fine-grained random forests using class decomposition: an application to medical diagnosis[J]. Neural Computing and Applications, 2016, 27(8): 2279-2288. DOI URL
[8]	米允龙, 米春桥, 刘文奇. 海量数据挖掘过程相关技术研究进展[J]. 计算机科学与探索, 2015, 9(6): 641-659.
	MI Y L, MI C Q, LIU W Q. Research advance on related technology of massive data mining process[J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(6): 641-659.
[9]	宋杰, 孙宗哲, 毛克明, 等. MapReduce大数据处理平台与算法研究进展[J]. 软件学报, 2017, 28(3): 514-543.
	SONG J, SUN Z Z, MAO K M, et al. Research advance on MapReduce based on big data processing platforms and algorithms[J]. Journal of Software, 2017, 28(3): 514-543.
[10]	曹蒙蒙, 郭朝有. Hadoop平台下Mahout随机森林算法的分析与实现[J]. 舰船电子工程, 2018, 38(9): 40-44.
	CAO M M, GUO C Y. Analysis and implementation of random forest algorithm in Mahout based on Hadoop[J]. Ship Electronic Engineering, 2018, 38(9): 40-44.
[11]	钱雪忠, 秦静, 宋威. 改进的并行随机森林算法及其包外估计[J]. 计算机应用研究, 2018, 35(6): 1651-1654.
	QIAN X Z, QIN J, SONG W. Improved parallel random forest and its out_of_bag estimator[J]. Application Research of Computers, 2018, 35(6): 1651-1654.
[12]	CHEN J G, LI K L, TANG Z, et al. A parallel random forest algorithm for big data in a spark cloud computing environment[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(4): 919-933. DOI URL
[13]	LIU S, HU T Y. Parallel random forest algorithm optimization based on maximal information coefficient[C]// Proceedings of the 9th International Conference on Software Engineering and Service Science, Beijing, Nov 23-25, 2018. Piscataway: IEEE, 2018: 1-5.
[14]	SENA I G W, DILLAK J W, LEUNUPUN P, et al. Predicting rainfall intensity using Naïve Bayes and information gain methods[J]. Journal of Physics: Conference Series, 2020, 1577(1): 012011. DOI URL
[15]	GAO W F, HU L, ZHANG P. Feature redundancy term variation for mutual information-based feature selection[J]. Applied Intelligence, 2020, 50(4): 1272-1288. DOI URL
[16]	ZHANG F, GAO W F, LIU G X. Feature selection considering weighted relevancy[J]. Applied Intelligence, 2018, 48(12): 4615-4625. DOI URL
[17]	SERGEEV I. Generalizations of 2-dimensional diagonal quantum channels with constant Frobenius norm[J]. Reports on Mathematical Physics, 2019, 83(3): 349-372. DOI URL
[18]	陈向阳, 胡晓倩, 吴永祥, 等. 主成分分析法在生物技术专业核心课程成绩评价中的应用[J]. 安徽农业科学, 2020, 48(16): 262-264.
	CHEN X Y, HU X Q, WU Y X, et al. Application of principal component analysis in the grade evaluation of biotechnology specialty[J]. Journal of Anhui Agricultural Sciences, 2020, 48(16): 262-264.
[19]	李素, 袁志高, 王聪, 等. 群智能算法优化支持向量机参数综述[J]. 智能系统学报, 2018, 13(1): 70-84.
	LI S, YUAN Z G, WANG C, et al. Optimization of support vector machine parameters based on group intelligence algorithm[J]. CAAI Transactions on Intelligent Systems, 2018, 13(1): 70-84.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	11	0	0	68

来源	本网站	其他网站

次数	79	0
比例	100%	0%

摘要

217

最新录用	在线预览	正式出版

0	0	217

	来源	本网站

	次数	217
	比例	100%

结合信息论和范数的并行随机森林算法

Improved Parallel Random Forest Algorithm Combining Information Theory and Norm

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 19

相关文章 2

编辑推荐 0

Metrics

[1]	夏笑秋, 陈松灿. 改进的二视图随机森林[J]. 计算机科学与探索, 2022, 16(1): 144-152.
[2]	尹儒，门昌骞，王文剑. 一种模型决策森林算法[J]. 计算机科学与探索, 2020, 14(1): 108-116.