计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (9): 1025-1033.DOI: 10.3778/j.issn.1673-9418.1409021

• 学术研究 • 上一篇    下一篇

Deep Web数据采集查询构造方法研究

林海伦1+,杨晓刚2,熊锦华1,王元卓1,贾岩涛1,程学旗1   

  1. 1. 中国科学院 计算技术研究所 网络数据科学与技术重点实验室,北京 100190
    2. 新华社技术局实验室,北京 100803
  • 出版日期:2015-09-01 发布日期:2015-12-11

Research on Query Construction Method for Deep Web Data Crawling

LIN Hailun1+, YANG Xiaogang2, XIONG Jinhua1, WANG Yuanzhuo1, JIA Yantao1, CHENG Xueqi1   

  1. 1. Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    2. Laboratory of Xinhua News Agency and Technology Bureau, Beijing 100803, China
  • Online:2015-09-01 Published:2015-12-11

摘要: 网络大数据的大规模、多源异构、动态更新、高噪声给知识的获取带来了很大的挑战。特别地,很多网站隐藏在HTML表单后端的Web数据库中的Deep Web数据,只能通过提交表单查询的方式进行动态访问,网络爬虫难以通过页面之间的链接关系采集到这些数据,影响了获取到的知识资源的覆盖率,如何高效地采集这些数据并加以利用非常具有挑战性。为此对现有的Deep Web数据采集的查询构造方法进行了详细分析,分别介绍了针对不同类型的表单对应的Deep Web数据采集查询构造方法;总结了现有表层化方式的Deep Web数据采集查询构造方法的优缺点,并对Deep Web数据采集查询构造方法的未来工作进行了展望,以推动Deep Web数据采集技术的进一步发展。

关键词: Deep Web, 查询接口, 查询构造, 网络爬虫

Abstract: Network big data bring a great challenge to the knowledge acquisition because of large-scale, heterogeneity, dynamic and high noise. Specially, many websites data are hidden in Web databases behind the HTML forms, called Deep Web data, which can only be dynamically accessed by performing form submissions. These data can not be covered by Web crawlers as a result of using hyperlinks to collect resources, which affects the coverage of knowledge resources. Therefore, how to efficiently crawl these data and make use of them is challenging. This paper firstly presents a detailed analysis of the existing Deep Web data acquisition query construction methods, and introduces the Deep Web data acquisition query construction methods according to the different types of forms. Secondly, this paper concludes the advantages and limitations of the existing methods. Finally, this paper proposes the future work to promote the development of the Deep Web crawling techniques.

Key words: Deep Web, query interface, query construction, Web crawler