计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (6): 950-960.DOI: 10.3778/j.issn.1673-9418.1705045

• 人工智能与模式识别 • 上一篇    下一篇

面向新闻评论的短文本增量聚类算法

刘晓琳1,2,曹付元1,2,梁吉业1,2+   

  1. 1. 山西大学 计算机与信息技术学院,太原 030006
    2. 山西大学 计算智能与中文信息处理教育部重点实验室,太原 030006
  • 出版日期:2018-06-01 发布日期:2018-06-06

Incremental Algorithm for Clustering Short Texts on News Comments

LIU Xiaolin1,2, CAO Fuyuan1,2, LIANG Jiye1,2+   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
  • Online:2018-06-01 Published:2018-06-06

摘要: 对新闻评论进行增量聚类可以有效地发现网民对新闻事件的观点,在舆情分析领域具有十分重要的意义。针对传统的增量聚类算法存在对文本输入顺序敏感的缺点,提出了一种基于待定循环策略的增量聚类算法(uncertain cyclic Single-Pass,UCSP)。在聚类过程中,针对传统的短文本向量空间模型语义信息匮乏、特征项稀疏的问题,结合神经网络训练的词向量模型,构建了一种基于多特征组合的短文本表示模型。在爬取的5个腾讯新闻评论数据集进行实验,并与传统的文本表示模型和聚类算法进行对比分析,结果表明,所提算法可以有效地提高聚类质量。

关键词: 舆情分析, 短文本, 增量聚类算法, 向量空间模型, 神经网络

Abstract: Incremental clustering algorithms for news comments can effectively discover the views of netizens on the news event, which is of great significance in the field of public opinion analysis. The traditional algorithms for incremental clustering short texts are sensitive to the input sequence, this paper proposes an improved UCSP (uncertain cyclic Single-Pass) incremental clustering algorithm. In the process of clustering, the traditional vector space model for short texts is lack of semantic information, and has the disadvantage of sparse feature. Combined with neural network vector model, this paper constructs a new representation model for short texts based on compositional semantic features. Compared with the traditional texts representation models and clustering algorithms on 5 Tencent news comments data sets, the results show that the proposed algorithm can more effectively improve the quality of clustering.

Key words: public opinion analysis, short texts, incremental clustering algorithm, vector space model, neural network