针对开源论坛网页的信息抽取研究

doi:10.3778/j.issn.1673-9418.1510016

计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (1): 114-123.DOI: 10.3778/j.issn.1673-9418.1510016

针对开源论坛网页的信息抽取研究

刘春梅1,2+，郭岩1，俞晓明1，赵岭1，刘悦1，程学旗1

1. 中国科学院计算技术研究所，北京 100190
2. 中国科学院大学，北京 100190

出版日期:2017-01-01 发布日期:2017-01-10

Information Extraction Research Aimed at Open Source Web Pages

LIU Chunmei1,2+, GUO Yan1, YU Xiaoming1, ZHAO Ling1, LIU Yue1, CHENG Xueqi1

1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
2. University of Chinese Academy of Sciences, Beijing 100190, China

Online:2017-01-01 Published:2017-01-10

摘要/Abstract

摘要： 互联网上大量论坛使用开源软件生成，针对这类论坛，提出了针对论坛网页信息抽取的基于模板的信息抽取方法。首先给出了基于网页结构相似度的簇划分策略，并通过实验证明了该策略优于直接基于软件版本号等直观类别的划分策略；其次提出了基于开源软件特征的聚类算法，能够根据网页相似度将大规模开源软件生成的论坛网页进行有效的自动划分，形成可标注类别。实验表明，该方法不仅保持了基于模板的抽取方法所具有的高准确率的优点，同时弥补了其模板配置与维护代价高的缺点。

关键词: 记录定位, 网页聚类, 模板抽取

Abstract: There is a large proportion of forum Web pages generated by open source software. This paper proposes an information extraction method aimed at open source Web pages based on templates. Firstly, a clustering strategy based on the similarity of Web page structure is proposed. The experiment results show that the strategy is superior to the direct classification based on software version. Secondly, a clustering algorithm based on open source software features is proposed. It can cluster large-scale open source forum Web pages based on similarity automatically, and form a marked category. This method not only sharply decreases manual cost on annotation templates, but also increases the accuracy of information extraction.

Key words: record locating, Web page clustering, template extraction

刘春梅，郭岩，俞晓明，赵岭，刘悦，程学旗. 针对开源论坛网页的信息抽取研究[J]. 计算机科学与探索, 2017, 11(1): 114-123.

LIU Chunmei, GUO Yan, YU Xiaoming, ZHAO Ling, LIU Yue, CHENG Xueqi. Information Extraction Research Aimed at Open Source Web Pages[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(1): 114-123.

针对开源论坛网页的信息抽取研究

Information Extraction Research Aimed at Open Source Web Pages

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics