Journal of Frontiers of Computer Science and Technology ›› 2014, Vol. 8 ›› Issue (12): 1409-1421.DOI: 10.3778/j.issn.1673-9418.1406004

Previous Articles     Next Articles

Parallel Subject Indexing Algorithm in YARN Platform

LI Ruixuan+, LIAO Dongjie, GU Xiwu, WEN Kunmei, ZHAO Shuoyi, DONG Xinhua   

  1. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
  • Online:2014-12-01 Published:2014-12-08

YARN平台上的并行主题标引算法

李瑞轩+,廖东杰,辜希武,文坤梅,赵铄乂,董新华   

  1. 华中科技大学 计算机科学与技术学院,武汉 430074

Abstract: Subject indexing is a very important component in personalized intelligent search system. However, the huge amount of data resource makes it a great challenge in processing performance. Nowadays, the subject indexing over MapReduce computing framework has been widely used, which has shortcomings, such as time-consuming of starting the tasks and too many disk IOs. This paper adopts YARN (yet another resource negotiator) as the underlying platform, and chooses more appropriate calculation frameworks to improve the performance. For the feature of subject indexing algorithm, which is multistage, the directed acyclic graph (DAG) model is selected to avoid unnecessary operations of job split, which reduces the disk IOs of intermediate results. In addition, considering the sorting strategy is time-consuming, this paper adopts Hash-based data gathering strategy to improve computing performance. However, the new policy will bring the problem of random read. This paper designs an optimization strategy, which takes advantage of the feature of high-speed random read of solid state disk (SSD), to further improve the computational efficiency. Through the experimental results, choosing targeted computing framework based on YARN and optimizing it, can effectively improve computing performance.

Key words: subject indexing, YARN platform, directed acyclic graph (DAG) computation, solid state disk

摘要: 文档主题标引是当前个性化智能检索的重要前提,但面对大规模海量数据资源时,主题标引也成为性能瓶颈。当前在MapReduce框架上设计实现的主题标引算法,通常存在启动任务耗时长,中间数据过多地进行磁盘IO等缺陷。为了解决此类问题,采用YARN(yet another resource negotiator)作为底层分布式资源管理平台,选择更加合适的计算框架来改善计算性能。针对文档主题标引算法计算步骤多、阶段性强的特点,选择有向无环图(directed acyclic graph, DAG)计算模型进行算法实现,避免不必要的作业拆分,从而减少中间结果的磁盘IO。另外,考虑到MapReduce的排序策略耗时较多,而有些计算无需对结果排序,故可以改用基于Hash的数据归约策略来提高计算性能,但这又会带来随机读的问题。利用固态硬盘高速随机读的特性,设计相应的优化计算策略来解决随机读的问题。通过实验对比发现,以YARN为底层管理平台,在此基础上选择合适的计算框架并加以优化,可以有效改善分布式计算的性能。

关键词: 主题标引, YARN平台, 有向无环图计算框架, 固态硬盘