互联网上信息报道的最早发布时间检测

doi:10.3778/j.issn.1673-9418.2009.01.005

计算机科学与探索 ›› 2009, Vol. 3 ›› Issue (1): 51-59.DOI: 10.3778/j.issn.1673-9418.2009.01.005

互联网上信息报道的最早发布时间检测

黄连恩+,张燕,李晓明

北京大学信息科学技术学院，北京 100871

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-01-20 发布日期:2009-01-20
通讯作者: 黄连恩

The determination of the earliest news reporting time on the Web

HUANG Lian’en+, ZHANG Yan, LI Xiaoming

School of Electronics Engineering and Computer Science， Peking University， Beijing 100871， China

Received:1900-01-01 Revised:1900-01-01 Online:2009-01-20 Published:2009-01-20
Contact: HUANG Lian’en

摘要/Abstract

摘要： 准确提取网上信息报道的最早发布时间，对于使用计算机辅助的社会科学研究来说具有重要价值。数据表明，有40％的信息报道无法从网页中直接提取出文章发布时间，此时，如果单纯依靠搜集时间和HTTP协议提供的网页文件最后修改时间信息来估计文章发布时间，就会造成较大误差。提出了两种能够提高计算精度的方法：链接分析法和拷贝分析法。大数据量实验表明，这两种方法具有很小的出错概率，是切实可用的。其中，链接分析法能够在一定程度上减少计算误差，而拷贝分析法则具有决定性的作用。当一篇信息报道能在网上找到多个拷贝（转载）时，就会有很大的概率准确推断出该报道在网上的最早发布时间。

关键词: 文章发布时间检测, 网络信息挖掘, 网页内容分析, 文本消重

Abstract: Determination of the earliest time when an event is reported on the Web is of particular use for computer aided social science researches. Statistics has shown that 40% of Web pages have no evidence of publication time from their contents. For those cases， the crawling time or LMT （last-modified-time） from the HTTP header are often far off the real publication time. Therefore two methods for achieving better accuracy are proposed. The first one is based on link analysis and the other is based on replicas analysis. Experiments have shown that combining these two methods often gives rise to quite accurate results.

Key words: publication time, information mining, content analysis, replica detection

黄连恩+,张燕,李晓明. 互联网上信息报道的最早发布时间检测[J]. 计算机科学与探索, 2009, 3(1): 51-59.

HUANG Lian’en+, ZHANG Yan, LI Xiaoming. The determination of the earliest news reporting time on the Web[J]. Journal of Frontiers of Computer Science and Technology, 2009, 3(1): 51-59.

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	70

来源	本网站	其他网站

次数	69	1
比例	99%	1%

摘要

299

最新录用	在线预览	正式出版

0	0	299

	来源	本网站

	次数	299
	比例	100%

互联网上信息报道的最早发布时间检测

The determination of the earliest news reporting time on the Web

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐 0

Metrics