Co-Training——内容和链接的Web Spam检测方法*

doi:10.3778/j.issn.1673-9418.2010.10.004

计算机科学与探索 ›› 2010, Vol. 4 ›› Issue (10): 899-908.DOI: 10.3778/j.issn.1673-9418.2010.10.004

Co-Training——内容和链接的Web Spam检测方法*

魏小娟^1,2+, 李翠平^1,2, 陈红^1,2

1. 中国人民大学数据工程与知识工程国家教育部重点实验室, 北京 100872
2. 中国人民大学信息学院, 北京 100872

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-10-01 发布日期:2010-10-01
通讯作者: 魏小娟

Content and Link Based Web Spam Detection with Co-Training*

WEI Xiaojuan^1,2+, LI Cuiping^1,2, CHEN Hong^1,2

1. Key Lab of Data Engineering and Knowledge Engineering of MOE, Renmin University of China, Beijing 100872, China
2. School of Information, Renmin University of China, Beijing 100872, China

Received:1900-01-01 Revised:1900-01-01 Online:2010-10-01 Published:2010-10-01
Contact: WEI Xiaojuan

摘要/Abstract

摘要： Web spam是指通过内容作弊和网页间链接作弊来欺骗搜索引擎, 从而提升自身搜索排名的作弊网页, 它干扰了搜索结果的准确性和相关性。提出基于Co-Training模型的Web spam检测方法, 使用了网页的两组相互独立的特征——基于内容的统计特征和基于网络图的链接特征, 分别建立两个独立的基本分类器; 使用Co-Training半监督式学习算法, 借助大量未标记数据来改善分类器质量。在WEBSPAM-UK2007数据集上的实验证明：算法改善了SVM分类器的效果。

关键词: Web spam检测方法, 内容作弊, 链接作弊, Co-Training算法

Abstract: Web spam attempts to deceive search engine by crafting the content of Web pages or creating tight knit community of links around irrelevant Web pages, for the purpose of getting an undeserved high rank. It maliciously influences the accuracy and relevancy of ranking algorithms. This paper proposes a novel Web spam detection method based on Co-Training model. It builds two basic classifiers separately considering link-based and content- based features, then leverages unlabeled data along with a few labeled examples to boost the performance of the classifier through a semi-supervised algorithm—— Co-Training model. And the experimental results on WEBSPAM- UK2007 dataset demonstrate that the algorithm improves the efficiency and accuracy of SVM classifier.

Key words: Web spam detection method, content-based spam, link-based spam, Co-Training

中图分类号:

TP311

魏小娟1,2+ , 李翠平1,2 , 陈红1,2 .

Co-Training——内容和链接的Web Spam检测方法*

[J]. 计算机科学与探索, 2010, 4(10): 899-908.

WEI Xiaojuan^1,2+, LI Cuiping^1,2, CHEN Hong^1,2. Content and Link Based Web Spam Detection with Co-Training*[J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(10): 899-908.

[1]	林洪武 1,2 , 尤朝 1,2 , 周明辉1,2+ , 梅宏 1,2 . 以代理为中心的 OSGi 构件资源监控方法[J]. 计算机科学与探索, 2011, 5(1): 23-31.
[2]	朱小虎1,2 , 宋文军1,2 , 王崇骏1,2+ , 谢俊元1,2 . 用于社团发现的Girvan-Newman改进算法[J]. 计算机科学与探索, 2010, 4(12): 1101-1108.
[3]	曾红卫+, 缪淮扣 . 模型检验在构件数据流测试中的应用[J]. 计算机科学与探索, 2010, 4(12): 1121-1130.
[4]	袁崇义1,2 , 黄雨1,2,3+ , 赵文1,2,3 , 黄舒志2 . 操作表达式的Petri网表示*[J]. 计算机科学与探索, 2010, 4(11): 961-976.
[5]	张晓博+ ;廖湖声 . 支持XML查询代数和树模式查询的XQuery系统框架*[J]. 计算机科学与探索, 2010, 4(11): 996-1004.
[6]	张燕萍, 姜青山+ . k-means型软子空间聚类算法*[J]. 计算机科学与探索, 2010, 4(11): 1019-1026.
[7]	李东1+ , 邝子民2 . XPath结构连接顺序优化[J]. 计算机科学与探索, 2010, 4(11): 1049-1056.
[8]	冯钧+ ;陆春燕 . 路网数据流的预测聚集查询新方法研究*[J]. 计算机科学与探索, 2010, 4(11): 1027-1038.
[9]	周军锋+ ;李义国;郭景峰 . 面向PSTP查询的高效处理算法*[J]. 计算机科学与探索, 2010, 4(11): 1039-1048.
[10]	张柏礼1+ , 吕建华1 , 姚蓓2 , 胡新平1 , 张志政1 . Web代理服务器缓存置换算法研究*[J]. 计算机科学与探索, 2010, 4(11): 977-983.
[11]	运正佳, 李轶男, 杨晓春+ . 支持带有通配符的字符串匹配算法*[J]. 计算机科学与探索, 2010, 4(11): 984-995.
[12]	林子雨+, 林琛, 冯少荣, 张东站. MESHJOIN：实时数据仓库环境下的数据流更新算法[J]. 计算机科学与探索, 2010, 4(10): 927-939.
[13]	周军锋+ ;魏蕊; 郭景峰 . 面向更新的扩展Dewey编码*[J]. 计算机科学与探索, 2010, 4(10): 918-926.
[14]	黄维篁+ ;李国良;冯建华 . 高效的数据源选择方式*[J]. 计算机科学与探索, 2010, 4(10): 890-898.
[15]	刘俊岭1,2 , 孙焕良2+ . 多维度量空间中发现相互kNN*[J]. 计算机科学与探索, 2010, 4(10): 881-889.

Co-Training——内容和链接的Web Spam检测方法*

Content and Link Based Web Spam Detection with Co-Training*

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics