Journal of Frontiers of Computer Science and Technology ›› 2019, Vol. 13 ›› Issue (7): 1102-1113.DOI: 10.3778/j.issn.1673-9418.1809009

Previous Articles     Next Articles

Research on Improved BBTM Model for Microblog Hot Topic Discovery

HUANG Chang1,2,3, GUO Wenzhong1,2,3, GUO Kun1,2,3+   

  1. 1.College of Mathematics and Computer Sciences, Fuzhou University, Fuzhou 350116, China
    2.Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China
    3.Key Laboratory of Ministry of Education for Spatial Data Mining & Information Sharing, Fuzhou University, Fuzhou 350116, China
  • Online:2019-07-01 Published:2019-07-08

面向微博热点话题发现的改进BBTM模型研究

黄  畅1,2,3,郭文忠1,2,3,郭  昆1,2,3+   

  1. 1.福州大学 数学与计算机科学学院,福州 350116
    2.福州大学 网络计算与智能信息处理重点实验室,福州 350116
    3.福州大学 空间数据挖掘与信息共享教育部重点实验室,福州 350116

Abstract: In order to overcome the problems of current hot topic discovery methods based on topic model, such as the sparsity of features, the high dimension, and the requirement for pre-specifying the number of topics, a hot topic discovery method based on an improved bursty biterm topic model (BBTM) which is called hot topic-hot biterm topic model (H-HBTM) is proposed. First, the word burst probability is used to select features and to filter the non-burst words. Second, the hot burst probability of micro-blog word pairs can be expressed by integrating the burst characteristic and the propagation characteristic of micro-blog texts. The hot burst probability is used as the prior probability of the BBTM model. Finally, a density based method is used to select the optimal number of topics for the BBTM model so that the optimal BBTM model is determined to detect hot topics. The experiments conducted on the real micro-blog datasets demonstrate that the H-HBTM can automatically find the optimal model without pre-specifying the number of topics, and the quality of the hot topics found is superior to the other methods, such as the BBTM, the biterm topic model and the latent Dirichlet allocation.

Key words: hot topic detection, microblog, bursty biterm topic model (BBTM), topic model

摘要: 针对目前基于主题模型的微博短文本热点话题发现存在特征稀疏、高维度以及需要人工指定主题数目等问题,提出一种基于改进突发词对主题模型(bursty biterm topic model,BBTM)的热点话题发现方法(hot topic-hot biterm topic model,H-HBTM)。首先,利用词的突发概率进行特征选择,过滤非突发词。其次,结合微博文本的突发特性和传播特性计算微博词对的热值突发概率,将热值突发概率作为BBTM的先验概率。最后,利用基于密度的方法自适应选择BBTM的最优话题数目,确定最优BBTM,实现热点话题发现。在真实微博数据集上的实验表明,H-HBTM可以在不需要预先设定主题数目的情况下,自动发现最优话题模型,并且H-HBTM发现的热点话题的质量高于基于BBTM、词对主题模型以及潜在狄立克雷分配的方法。

关键词: 热点话题发现, 微博, 突发词对主题模型(BBTM), 主题模型