基于SparkR的分类算法并行化研究

doi:10.3778/j.issn.1673-9418.1503036

计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (11): 1281-1294.DOI: 10.3778/j.issn.1673-9418.1503036

基于SparkR的分类算法并行化研究

刘志强1,2+，顾荣1,2，袁春风1,2,3，黄宜华1,2,3

1. 南京大学计算机软件新技术国家重点实验室，南京 210046
2. 南京大学计算机科学与技术系，南京 210046
3. 江苏省软件新技术与产业化协同创新中心，南京 210046

出版日期:2015-11-01 发布日期:2015-11-03

Parallelization of Classification Algorithms Based on SparkR

LIU Zhiqiang1,2+, GU Rong1,2, YUAN Chunfeng1,2,3, HUANG Yihua1,2,3

1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, China
2. Department of Computer Science and Technology, Nanjing University, Nanjing 210046, China
3. Collaborative Innovation Center for Novel Software Technology and Industry of Jiangsu Province, Nanjing 210046, China

Online:2015-11-01 Published:2015-11-03

摘要/Abstract

摘要： 近几年来，大数据机器学习和数据挖掘并行化算法研究成为大数据领域一个较为重要的研究热点。Spark提供了一个称为SparkR的编程接口，方便一般应用领域的数据分析人员使用所熟悉的R语言在Spark平台上完成数据分析和计算。基于SparkR设计并实现了多种常用的并行化的机器学习分类算法，包括多项式贝叶斯分类算法、支持向量机（support vector machine，SVM）算法和Logistic Regression算法。对于SVM和Logistic Regression算法，在常规的并行化策略的基础上为了进一步提升训练速度，设计采用了并行化局部优化的迭代计算模式。实验结果表明，所设计实现的基于SparkR的并行化分类算法与Hadoop MapReduce的方案相比，速度上提升了8倍左右。

关键词: SparkR, 分类算法, 并行化, 局部迭代, 内存计算

Abstract: In recent years, parallelizing algorithms for big data machine learning and data mining have become an important research issue in the field of big data. Spark provides a programming interface called SparkR to support data analysts who are familiar with the R language in the general application areas to conduct the data analysis and computations on the Spark platform. This paper proposes the design and implementation of several widely-used parallel classification algorithms including Multinomial NaiveBayes, SVM (support vector machine) and Logistic Regression based on SparkR. This paper also presents how to optimize the SVM and Logistic Regression algorithms to improve the training speed based on conventional parallel strategies. The experimental results show that the efficiency of the classification algorithms based on SparkR outperforms Hadoop MapReduce with 8 times of speedup without losing scalability.

Key words: SparkR, classification algorithm, parallelization, local iteration, in-memory computation

刘志强，顾荣，袁春风，黄宜华. 基于SparkR的分类算法并行化研究[J]. 计算机科学与探索, 2015, 9(11): 1281-1294.

LIU Zhiqiang, GU Rong, YUAN Chunfeng, HUANG Yihua. Parallelization of Classification Algorithms Based on SparkR[J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(11): 1281-1294.

[1]	巢秀琴，李炜. 人工蜂群算法优化的特征选择方法[J]. 计算机科学与探索, 2019, 13(2): 300-309.
[2]	吴煜，杨爱萍，章宦记，王建，刘立. 基于黎曼与巴氏距离的脑磁图信号分类方法[J]. 计算机科学与探索, 2017, 11(5): 776-784.
[3]	杨柳，王钰. 组块3×2交叉验证的F1度量的方差分析[J]. 计算机科学与探索, 2016, 10(8): 1176-1183.
[4]	陆莉莉，张永潘，谈海宇，季一木. 大数据分类挖掘算法及其概念漂移应用研究[J]. 计算机科学与探索, 2016, 10(12): 1683-1692.
[5]	杨林青，李湛，牟雁超，樊里略，李红燕，王腾蛟，雷凯. 面向大规模数据集的并行化Top-k Skyline查询算法[J]. 计算机科学与探索, 2015, 9(8): 897-905.
[6]	江凯，高阳. 并行化的半监督朴素贝叶斯分类算法[J]. 计算机科学与探索, 2012, 6(10): 912-918.
[7]	谭郁松, 吴庆波. 并行程序自动优化虚拟化框架[J]. 计算机科学与探索, 2011, 5(6): 513-521.

基于SparkR的分类算法并行化研究

Parallelization of Classification Algorithms Based on SparkR

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics