Journal of Frontiers of Computer Science and Technology ›› 2017, Vol. 11 ›› Issue (1): 114-123.DOI: 10.3778/j.issn.1673-9418.1510016

Previous Articles     Next Articles

Information Extraction Research Aimed at Open Source Web Pages

LIU Chunmei1,2+, GUO Yan1, YU Xiaoming1, ZHAO Ling1, LIU Yue1, CHENG Xueqi1   

  1. 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100190, China
  • Online:2017-01-01 Published:2017-01-10

针对开源论坛网页的信息抽取研究

刘春梅1,2+,郭  岩1,俞晓明1,赵  岭1,刘  悦1,程学旗1   

  1. 1. 中国科学院 计算技术研究所,北京 100190
    2. 中国科学院大学,北京 100190

Abstract: There is a large proportion of forum Web pages generated by open source software. This paper proposes an information extraction method aimed at open source Web pages based on templates. Firstly, a clustering strategy based on the similarity of Web page structure is proposed. The experiment results show that the strategy is superior to the direct classification based on software version. Secondly, a clustering algorithm based on open source software features is proposed. It can cluster large-scale open source forum Web pages based on similarity automatically, and form a marked category. This method not only sharply decreases manual cost on annotation templates, but also increases the accuracy of information extraction.

Key words: record locating, Web page clustering, template extraction

摘要: 互联网上大量论坛使用开源软件生成,针对这类论坛,提出了针对论坛网页信息抽取的基于模板的信息抽取方法。首先给出了基于网页结构相似度的簇划分策略,并通过实验证明了该策略优于直接基于软件版本号等直观类别的划分策略;其次提出了基于开源软件特征的聚类算法,能够根据网页相似度将大规模开源软件生成的论坛网页进行有效的自动划分,形成可标注类别。实验表明,该方法不仅保持了基于模板的抽取方法所具有的高准确率的优点,同时弥补了其模板配置与维护代价高的缺点。

关键词: 记录定位, 网页聚类, 模板抽取