Journal of Frontiers of Computer Science and Technology ›› 2015, Vol. 9 ›› Issue (11): 1281-1294.DOI: 10.3778/j.issn.1673-9418.1503036

Previous Articles     Next Articles

Parallelization of Classification Algorithms Based on SparkR

LIU Zhiqiang1,2+, GU Rong1,2, YUAN Chunfeng1,2,3, HUANG Yihua1,2,3   

  1. 1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, China
    2. Department of Computer Science and Technology, Nanjing University, Nanjing 210046, China
    3. Collaborative Innovation Center for Novel Software Technology and Industry of Jiangsu Province, Nanjing 210046, China
  • Online:2015-11-01 Published:2015-11-03

基于SparkR的分类算法并行化研究

刘志强1,2+,顾  荣1,2,袁春风1,2,3,黄宜华1,2,3   

  1. 1. 南京大学 计算机软件新技术国家重点实验室,南京 210046
    2. 南京大学 计算机科学与技术系,南京 210046
    3. 江苏省软件新技术与产业化协同创新中心,南京 210046

Abstract: In recent years, parallelizing algorithms for big data machine learning and data mining have become an important research issue in the field of big data. Spark provides a programming interface called SparkR to support data analysts who are familiar with the R language in the general application areas to conduct the data analysis and computations on the Spark platform. This paper proposes the design and implementation of several widely-used parallel classification algorithms including Multinomial NaiveBayes, SVM (support vector machine) and Logistic Regression based on SparkR. This paper also presents how to optimize the SVM and Logistic Regression algorithms to improve the training speed based on conventional parallel strategies. The experimental results show that the efficiency of the classification algorithms based on SparkR outperforms Hadoop MapReduce with 8 times of speedup without losing scalability.

Key words: SparkR, classification algorithm, parallelization, local iteration, in-memory computation

摘要: 近几年来,大数据机器学习和数据挖掘并行化算法研究成为大数据领域一个较为重要的研究热点。Spark提供了一个称为SparkR的编程接口,方便一般应用领域的数据分析人员使用所熟悉的R语言在Spark平台上完成数据分析和计算。基于SparkR设计并实现了多种常用的并行化的机器学习分类算法,包括多项式贝叶斯分类算法、支持向量机(support vector machine,SVM)算法和Logistic Regression算法。对于SVM和Logistic Regression算法,在常规的并行化策略的基础上为了进一步提升训练速度,设计采用了并行化局部优化的迭代计算模式。实验结果表明,所设计实现的基于SparkR的并行化分类算法与Hadoop MapReduce的方案相比,速度上提升了8倍左右。

关键词: SparkR, 分类算法, 并行化, 局部迭代, 内存计算