Journal of Frontiers of Computer Science and Technology ›› 2017, Vol. 11 ›› Issue (5): 732-741.DOI: 10.3778/j.issn.1673-9418.1608041

Previous Articles     Next Articles

Feature Extension and Category Research for Short Text Based on Spark Platform

WANG Wen1,2, ZHAO Kankan1,2, LI Cuiping1,2+, CHEN Hong1,2, SUN Hui1,2   

  1. 1. Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing 100872, China
    2. School of Information, Renmin University of China, Beijing 100872, China
  • Online:2017-05-01 Published:2017-05-04


王  雯1,2,赵衎衎1,2,李翠平1,2+,陈  红1,2,孙  辉1,2   

  1. 1. 中国人民大学 数据工程与知识工程教育部重点实验室,北京 100872
    2. 中国人民大学 信息学院,北京 100872

Abstract: Short text classification is often confronted with some limitations including high feature dimensions, sparse feature existences and poor classification accuracy, which can be solved by feature extension effectively. However, it decreases the execution efficiency greatly. To improve classification accuracy and efficiency of short text, this paper proposes a new solution, association rule based feature extension method which is designed on Spark platform. Given a background data set of short text corpus, firstly extend origin corpus and complement the features by mining the association rules and the corresponding confidences. Then apply a new cascade SVM (support vector machine) algorithm based on distance to choose during classification. Finally design the feature extension and classification algorithm of short text on Spark platform and improve the efficiency of short text processing through distributed algorithm. The experiments show that the new method gains 4 times of efficiency improvement compared with the traditional method and 15% increase in classification accuracy, in which the accuracy of feature extension and classification optimization is 10% and 5% respectively.

Key words:  short text classification, feature extension, association rule, Spark platform

摘要: 短文本分类经常面临特征维度高、特征稀疏、分类准确率差的问题。特征扩展是解决上述问题的有效方法,但却面临更大的短文本分类效率瓶颈。结合以上问题和现状,针对如何提升短文本分类准确率及效率进行了详细研究,提出了一种Spark平台上的基于关联规则挖掘的短文本特征扩展及分类方法。该方法首先采用背景语料库,通过关联规则挖掘的方式对原短文本进行特征补充;其次针对分类过程,提出基于距离选择的层叠支持向量机(support vector machine,SVM)算法;最后设计Spark平台上的短文本特征扩展与分类算法,通过分布式算法设计,提高短文本处理的效率。实验结果显示,采用提出的Spark平台上基于关联规则挖掘的短文本特征扩展方法后,针对大数据集,Spark集群上短文本特征扩展及分类效率约为传统单机上效率的4倍,且相比于传统分类实验,平均得到约15%的效率提升,其中特征扩展及分类优化准确率提升分别为10%与5%。

关键词: 短文本分类, 特征扩展, 关联规则, Spark平台