S-SimRank：结合内容和链接信息的文档相似度计算方法

doi:10.3778/j.issn.1673-9418.2009.04.005

计算机科学与探索 ›› 2009, Vol. 3 ›› Issue (4): 378-391.DOI: 10.3778/j.issn.1673-9418.2009.04.005

S-SimRank：结合内容和链接信息的文档相似度计算方法

蔡元哲^1,2,李佩^1,2,刘红岩³,何军^1,2+,杜小勇^1,2

1. 中国人民大学教育部数据工程和知识工程重点实验室，北京 100872
2. 中国人民大学信息学院，北京 100872
3. 清华大学管理科学与工程系，北京 100084

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-07-15 发布日期:2009-07-15
通讯作者: 何军

S-SimRank： Combining Content and Link Information to Cluster Papers Effectively and Efficiently

CAI Yuanzhe^1,2, LI Pei^1,2, LIU Hongyan³, HE Jun^1,2+, DU Xiaoyong^1,2

1. Key Laboratory of Data Engineering and Knowledge Engineering， Ministry of Education， Renmin University of China， Beijing 100872， China
2. School of Information， Renmin University of China， Beijing 100872，China
3. Department of Management Science and Engineering， Tsinghua University， Beijing 100084， China

Received:1900-01-01 Revised:1900-01-01 Online:2009-07-15 Published:2009-07-15
Contact: HE Jun

摘要/Abstract

摘要： 文档的内容分析和连接分析是计算文档相似度的两种方法。连接分析能够发现文档之间的隐含关系，但是，由于文档之间的噪声的存在，这种方法很难得到精确的结果。为了解决这个问题，提出了一个新的算法—S-SimRank（Star-SimRank），有效地将文档的内容信息和连接信息结合在一起从而提高了文档相似度计算的准确性。S-Simrank算法在ACM数据集上无论是准确性和效率都比其他算法有了很大地提高。最后，给出了S-SimRank的收敛性的数学证明。

关键词: 连接分析, 相似度计算, 文本分析

Abstract: Content analysis and link analysis among documents are two common methods in recommending system. Compared with content analysis， link analysis can discover more implicit relationship between documents. At the same time， because of the noise， these methods can’t gain precise result. To solve this problem， a new algorithm， S-SimRank （Star-SimRank）， is proposed to effectively combine content analysis and link analysis to improve the accuracy of similarity calculation. The experimental results for the ACM data set show that S-SimRank outperforms other algorithms. In the end， the mathematic prove for the convergence of S-SimRank is given.

Key words: linkage mining, similarity calculation, text mining

蔡元哲1,2 ,李佩1,2 ,刘红岩3 ,何军1,2+ ,杜小勇1,2 . S-SimRank：结合内容和链接信息的文档相似度计算方法[J]. 计算机科学与探索, 2009, 3(4): 378-391.

CAI Yuanzhe^1,2, LI Pei^1,2, LIU Hongyan³, HE Jun^1,2+, DU Xiaoyong^1,2

. S-SimRank： Combining Content and Link Information to Cluster Papers Effectively and Efficiently[J]. Journal of Frontiers of Computer Science and Technology, 2009, 3(4): 378-391.

[1]	蔡明昕, 孙晶, 王斌. 多角度语义轨迹相似度计算模型[J]. 计算机科学与探索, 2021, 15(9): 1632-1640.
[2]	韩其琛，李冬梅. 基于叙词表的林业信息语义检索模型[J]. 计算机科学与探索, 2016, 10(1): 122-129.
[3]	邹李1,2 , 杜小勇1,2+ , 何军1,2 . B3：图间节点相似度分块计算方法*[J]. 计算机科学与探索, 2010, 4(9): 780-790.

S-SimRank：结合内容和链接信息的文档相似度计算方法

S-SimRank： Combining Content and Link Information to Cluster Papers Effectively and Efficiently

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics