计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (10): 1162-1176.DOI: 10.3778/j.issn.1673-9418.1312042

• 学术研究 • 上一篇    下一篇

基于OpenCL的Kmeans算法的优化研究

吴再龙1,2+,张云泉2,徐建良1,贾海鹏2,颜深根3,解庆春3   

  1. 1. 中国海洋大学 信息科学与工程学院,山东 青岛 266100
    2. 中国科学院 计算技术研究所 计算机体系结构国家重点实验室,北京 100190
    3. 中国科学院 软件研究所 并行软件与计算科学实验室,北京 100190
  • 出版日期:2014-10-01 发布日期:2014-09-29

Research on Kmeans Algorithm Optimization Based on OpenCL

WU Zailong1,2+, ZHANG Yunquan2, XU Jianliang1, JIA Haipeng2, YAN Shengen3, XIE Qingchun3   

  1. 1. School of Information Science and Technology, Ocean University of China, Qingdao, Shandong 266100, China
    2. Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    3. Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
  • Online:2014-10-01 Published:2014-09-29

摘要: Kmeans算法是无监督机器学习中一种典型的聚类算法,是对已知数据集进行划分和分组的重要方法,在图像处理、数据挖掘、生物学领域有着广泛的应用。随着实际应用中数据规模的不断变大,对Kmeans算法的性能也提出了更高的要求。在充分考虑不同硬件平台体系架构差异的基础上,系统地研究了Kmeans算法在GPU和APU平台上实现与优化的关键技术:片上全局同步高效实现,冗余计算减少全局同步次数,线程任务重映射,局部内存重用等,实现了Kmeans算法在不同硬件平台上的高性能与性能移植。实验结果表明,优化后的算法在考虑数据传输时间的前提下,在AMD HD7970 GPU上相对于CPU版本取得136.975~170.333倍的加速比,在AMD A10-5800K APU上相对于CPU版本取得22.236 5~24.386 5倍的加速比,有效验证了优化方法的有效性和平台的可移植性。

关键词: OpenCL, 并行计算, Kmeans, 迭代算法, 跨平台

Abstract: As a typical clustering algorithm and an important method to data decomposition and packet processing, Kmeans algorithm is widely used in image processing, machine learning and biology, etc. Due to the constant expansion on data set, Kmeans is facing more and more demand on its performance. Having taken into full account the difference between hardware platforms and architectures, this paper conducts a systematic research on achieving Kmeans algorithm efficiently running on GPU and APU platforms based on OpenCL. And with the help of several optimization methods, such as the implementation of iterative algorithm with multiple global synchronization in GPU, the reduction on global synchronization by redundant computation, the redistribution on thread task, the reuse of local memory, etc, Kmeans algorithm achieves high efficient implementation on different hardware architectures and the optimization methods suitable for iterative algorithm are summed up. The experimental results show that the optimized algorithm gets 136.975~170.333 times speedup on AMD HD7970 GPU than the CPU version (with considering the data transfer time) and gets 22.2365~24.3865 times speedup on AMD A10-5800K APU than the CPU version, which effectively verifies the validity and the cross-platform ability of the optimization methods proposed in this paper.

Key words: OpenCL, parallel computing, Kmeans, iterative algorithm, cross-platform