计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (4): 565-576.DOI: 10.3778/j.issn.1673-9418.1603050

• 网络与信息安全 • 上一篇    下一篇

适用于大规模信息网络的语义社区发现方法

沈桂兰1,2+,贾彩燕3,于  剑3,杨小平2   

  1. 1. 北京联合大学 商务学院,北京 100025
    2. 中国人民大学 信息学院,北京 100087
    3. 北京交通大学 计算机与信息技术学院,北京 100044
  • 出版日期:2017-04-12 发布日期:2017-04-12

Semantic Community Detection Algorithm for Large Scale Information Network

SHEN Guilan1,2+, JIA Caiyan3, YU Jian3, YANG Xiaoping2   

  1. 1. Business School, Beijing Union University, Beijing 100025, China
    2. School of Information, Remin University, Beijing 100087, China
    3. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Online:2017-04-12 Published:2017-04-12

摘要: 对节点带有内容的信息网络进行语义社区发现是新的研究方向。融合节点内容增加了算法的复杂度。提出了一种在线性时间内进行语义社区发现的标签传播算法,用LDA(latent Dirichlet allocation)主题模型表示节点内容,以节点内容相似度和传播影响力的乘性模型作为标签传播的策略,在归一化过程中,自然融合节点内容和网络结构信息,标签迭代过程中,采用节点与绝大部分邻居节点内容不相同才进行更新的策略,保证算法的运行效率。通过在不同规模的12个真实数据集上进行实验,以模块度和纯度作为度量标准,验证了算法在语义社区发现上的有效性和可行性。

关键词: 语义社区发现, LDA主题模型, 内容相似度, 标签传播策略, 传播影响力

Abstract: Information network is a kind of complex network with semantic information. The semantic community detection of information network is a new research direction. The complexity of community detection algorithm is increased by considering the node content. Therefore this paper proposes a label propagation algorithm which is suitable for dealing with large scale information network in linear time. Firstly, the latent Dirichlet allocation topic model is used to represent the node content. Secondly, the multiplicative model of content similarity and propagation influence is taken as the label propagation strategy. And the content and the network topology are combined naturally in the normalization. Thirdly, the algorithm updates the node label while the node and the vast majority of neighbors are not the same. Extensive experiments on 12 real-world datasets with varying sizes and characteristics validate the proposed method outperforms other baseline algorithms in quality.

Key words: semantic community detection, latent Dirichlet allocation topic model, content similarity, label propagation strategy, influence propagation