Deep Web数据采集查询构造方法研究

doi:10.3778/j.issn.1673-9418.1409021

计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (9): 1025-1033.DOI: 10.3778/j.issn.1673-9418.1409021

Deep Web数据采集查询构造方法研究

林海伦1+，杨晓刚2，熊锦华1，王元卓1，贾岩涛1，程学旗1

1. 中国科学院计算技术研究所网络数据科学与技术重点实验室，北京 100190
2. 新华社技术局实验室，北京 100803

出版日期:2015-09-01 发布日期:2015-12-11

Research on Query Construction Method for Deep Web Data Crawling

LIN Hailun1+, YANG Xiaogang2, XIONG Jinhua1, WANG Yuanzhuo1, JIA Yantao1, CHENG Xueqi1

1. Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
2. Laboratory of Xinhua News Agency and Technology Bureau, Beijing 100803, China

Online:2015-09-01 Published:2015-12-11

摘要/Abstract

摘要： 网络大数据的大规模、多源异构、动态更新、高噪声给知识的获取带来了很大的挑战。特别地，很多网站隐藏在HTML表单后端的Web数据库中的Deep Web数据，只能通过提交表单查询的方式进行动态访问，网络爬虫难以通过页面之间的链接关系采集到这些数据，影响了获取到的知识资源的覆盖率，如何高效地采集这些数据并加以利用非常具有挑战性。为此对现有的Deep Web数据采集的查询构造方法进行了详细分析，分别介绍了针对不同类型的表单对应的Deep Web数据采集查询构造方法；总结了现有表层化方式的Deep Web数据采集查询构造方法的优缺点，并对Deep Web数据采集查询构造方法的未来工作进行了展望，以推动Deep Web数据采集技术的进一步发展。

关键词: Deep Web, 查询接口, 查询构造, 网络爬虫

Abstract: Network big data bring a great challenge to the knowledge acquisition because of large-scale, heterogeneity, dynamic and high noise. Specially, many websites data are hidden in Web databases behind the HTML forms, called Deep Web data, which can only be dynamically accessed by performing form submissions. These data can not be covered by Web crawlers as a result of using hyperlinks to collect resources, which affects the coverage of knowledge resources. Therefore, how to efficiently crawl these data and make use of them is challenging. This paper firstly presents a detailed analysis of the existing Deep Web data acquisition query construction methods, and introduces the Deep Web data acquisition query construction methods according to the different types of forms. Secondly, this paper concludes the advantages and limitations of the existing methods. Finally, this paper proposes the future work to promote the development of the Deep Web crawling techniques.

Key words: Deep Web, query interface, query construction, Web crawler

林海伦，杨晓刚，熊锦华，王元卓，贾岩涛，程学旗. Deep Web数据采集查询构造方法研究[J]. 计算机科学与探索, 2015, 9(9): 1025-1033.

LIN Hailun, YANG Xiaogang, XIONG Jinhua, WANG Yuanzhuo, JIA Yantao, CHENG Xueqi. Research on Query Construction Method for Deep Web Data Crawling[J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(9): 1025-1033.

[1]	任斌斌，谢振平，刘渊. 领域资讯的个性化建构抽取建模研究[J]. 计算机科学与探索, 2019, 13(8): 1370-1379.
[2]	詹恒飞1+ , 杨岳湘2 , 方宏2 . Nutch 分布式网络爬虫研究与优化[J]. 计算机科学与探索, 2011, 5(1): 68-74.
[3]	刘全1,2+ ,崔志明1 ,高阳2 ,伏玉琛1 ,凌兴宏1 . 利用tableau方法修正Deep Web中不相容知识[J]. 计算机科学与探索, 2009, 3(1): 60-67.
[4]	聂铁铮,于戈+,申德荣,寇月 . 基于实例的Deep Web数据源结果模式匹配技术[J]. 计算机科学与探索, 2008, 2(6): 601-613.

Deep Web数据采集查询构造方法研究

Research on Query Construction Method for Deep Web Data Crawling

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics