计算机科学与探索 ›› 2009, Vol. 3 ›› Issue (1): 51-59.DOI: 10.3778/j.issn.1673-9418.2009.01.005

• 学术研究 • 上一篇    下一篇

互联网上信息报道的最早发布时间检测

黄连恩+,张 燕,李晓明   

  1. 北京大学 信息科学技术学院,北京 100871
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-01-20 发布日期:2009-01-20
  • 通讯作者: 黄连恩

The determination of the earliest news reporting time on the Web

HUANG Lian’en+, ZHANG Yan, LI Xiaoming   

  1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-01-20 Published:2009-01-20
  • Contact: HUANG Lian’en

摘要: 准确提取网上信息报道的最早发布时间,对于使用计算机辅助的社会科学研究来说具有重要价值。数据表明,有40%的信息报道无法从网页中直接提取出文章发布时间,此时,如果单纯依靠搜集时间和HTTP协议提供的网页文件最后修改时间信息来估计文章发布时间,就会造成较大误差。提出了两种能够提高计算精度的方法:链接分析法和拷贝分析法。大数据量实验表明,这两种方法具有很小的出错概率,是切实可用的。其中,链接分析法能够在一定程度上减少计算误差,而拷贝分析法则具有决定性的作用。当一篇信息报道能在网上找到多个拷贝(转载)时,就会有很大的概率准确推断出该报道在网上的最早发布时间。

关键词: 文章发布时间检测, 网络信息挖掘, 网页内容分析, 文本消重

Abstract: Determination of the earliest time when an event is reported on the Web is of particular use for computer aided social science researches. Statistics has shown that 40% of Web pages have no evidence of publication time from their contents. For those cases, the crawling time or LMT (last-modified-time) from the HTTP header are often far off the real publication time. Therefore two methods for achieving better accuracy are proposed. The first one is based on link analysis and the other is based on replicas analysis. Experiments have shown that combining these two methods often gives rise to quite accurate results.

Key words: publication time, information mining, content analysis, replica detection