Journal of Frontiers of Computer Science and Technology ›› 2014, Vol. 8 ›› Issue (9): 1049-1066.DOI: 10.3778/j.issn.1673-9418.1310017

Previous Articles     Next Articles

Research on Automated Web Navigation and Data Integration Rules for Web Information Extraction

WANG Haitao1,2,ZHANG Zhiliang3, SUN Yuhua3, YUAN Chunfeng1,2,HUANG Yihua1,2+   

  1. 1. Department of Computer Science and Technology, Nanjing University, Nanjing 210046, China
    2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, China
    3. Guangzhou Power Supply Co. Ltd., Guangzhou 510620, China
  • Online:2014-09-01 Published:2014-09-03

Web信息抽取网页自动浏览导航与集成规则研究

王海涛1,2,张志亮3,孙煜华3,袁春风1,2,黄宜华1,2+   

  1. 1. 南京大学 计算机科学与技术系,南京 210046
    2. 南京大学 计算机软件新技术国家重点实验室,南京 210046
    3. 广州供电局 信息中心,广州 510620

Abstract: Web contains large amount of valuable data information. Many Web information extraction techniques have been studied in past decade. However, most of existing studies and systems focus on data extraction processing from acquired Web pages, and ignore or simplify the automated navigation and data integration processes. To solve the problem, this paper proposes a three-stage Web information extraction model including automated navigation, data extraction and data integration. Based on this model, this paper designs a navigation model along with an automated navigation rule language. Furthermore, this paper proposes an ETI (extraction-transformation-integration) model and an integration and workflow control rule language, which can effectively maintain the complex relationship for cross-page data record and provide flexible workflow control. Extraction results show that the proposed rule language and the implemented system can effectively achieve Web page navigation and data extraction.

Key words: Web information extraction, automated Web navigation, data integration, workflow control, rule language

摘要: Web中蕴藏着大量有价值的数据,过去十几年中,针对Web信息抽取技术已有较多的研究。而现有的研究和系统多集中在数据抽取处理阶段,忽略或简化了完整的Web信息抽取过程需要的网页自动浏览导航和集成处理。为克服这些不足,提出了包含浏览导航、数据抽取和集成过程的三阶段Web信息抽取处理模型,基于此进一步研究提出了自动浏览导航模型,并设计实现了网页自动浏览导航规则语言。研究提出了一种Web数据抽取、转换和集成(extraction-transformation-integration,ETI)模型,设计实现了一套灵活有效的数据集成和流程控制规则语言,能有效地维护跨网页数据记录的复杂关系,并提供灵活的流程控制能力。抽取实例的结果表明,该规则语言和系统可有效完成全过程化的Web信息抽取集成处理功能。

关键词: Web信息抽取, 自动浏览导航, 数据集成, 流程控制, 规则语言