Web信息抽取网页自动浏览导航与集成规则研究

doi:10.3778/j.issn.1673-9418.1310017

计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (9): 1049-1066.DOI: 10.3778/j.issn.1673-9418.1310017

Web信息抽取网页自动浏览导航与集成规则研究

王海涛1,2，张志亮3，孙煜华3，袁春风1,2，黄宜华1,2+

1. 南京大学计算机科学与技术系，南京 210046
2. 南京大学计算机软件新技术国家重点实验室，南京 210046
3. 广州供电局信息中心，广州 510620

出版日期:2014-09-01 发布日期:2014-09-03

Research on Automated Web Navigation and Data Integration Rules for Web Information Extraction

WANG Haitao1,2，ZHANG Zhiliang3, SUN Yuhua3, YUAN Chunfeng1,2，HUANG Yihua1,2+

1. Department of Computer Science and Technology, Nanjing University, Nanjing 210046, China
2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, China
3. Guangzhou Power Supply Co. Ltd., Guangzhou 510620, China

Online:2014-09-01 Published:2014-09-03

摘要/Abstract

摘要： Web中蕴藏着大量有价值的数据，过去十几年中，针对Web信息抽取技术已有较多的研究。而现有的研究和系统多集中在数据抽取处理阶段，忽略或简化了完整的Web信息抽取过程需要的网页自动浏览导航和集成处理。为克服这些不足，提出了包含浏览导航、数据抽取和集成过程的三阶段Web信息抽取处理模型，基于此进一步研究提出了自动浏览导航模型，并设计实现了网页自动浏览导航规则语言。研究提出了一种Web数据抽取、转换和集成（extraction-transformation-integration，ETI）模型，设计实现了一套灵活有效的数据集成和流程控制规则语言，能有效地维护跨网页数据记录的复杂关系，并提供灵活的流程控制能力。抽取实例的结果表明，该规则语言和系统可有效完成全过程化的Web信息抽取集成处理功能。

关键词: Web信息抽取, 自动浏览导航, 数据集成, 流程控制, 规则语言

Abstract: Web contains large amount of valuable data information. Many Web information extraction techniques have been studied in past decade. However, most of existing studies and systems focus on data extraction processing from acquired Web pages, and ignore or simplify the automated navigation and data integration processes. To solve the problem, this paper proposes a three-stage Web information extraction model including automated navigation, data extraction and data integration. Based on this model, this paper designs a navigation model along with an automated navigation rule language. Furthermore, this paper proposes an ETI (extraction-transformation-integration) model and an integration and workflow control rule language, which can effectively maintain the complex relationship for cross-page data record and provide flexible workflow control. Extraction results show that the proposed rule language and the implemented system can effectively achieve Web page navigation and data extraction.

Key words: Web information extraction, automated Web navigation, data integration, workflow control, rule language

王海涛，张志亮，孙煜华，袁春风，黄宜华. Web信息抽取网页自动浏览导航与集成规则研究[J]. 计算机科学与探索, 2014, 8(9): 1049-1066.

WANG Haitao，ZHANG Zhiliang, SUN Yuhua, YUAN Chunfeng，HUANG Yihua. Research on Automated Web Navigation and Data Integration Rules for Web Information Extraction[J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(9): 1049-1066.

[1]	荣欢, 马廷淮. 利用收益预测与策略梯度两阶段众包评论集成[J]. 计算机科学与探索, 2021, 15(8): 1476-1489.
[2]	葛强，沈国华，黄志球，柯昌博，贾哲. Web服务中支持本体推理的隐私保护研究[J]. 计算机科学与探索, 2013, 7(6): 536-544.
[3]	刘伟+ ; 肖建国 . 多Web数据源环境下的重复实体识别方法研究*[J]. 计算机科学与探索, 2010, 4(7): 599-607.
[4]	姜芳艽1,2+ ,孟小峰1 . Deep Web数据集成中查询处理的研究与进展[J]. 计算机科学与探索, 2009, 3(2): 113-129.

Web信息抽取网页自动浏览导航与集成规则研究

Research on Automated Web Navigation and Data Integration Rules for Web Information Extraction

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics