基于LCA分块算法的大学科研人员信息抽取

doi:10.3778/j.issn.1673-9418.1508055

计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (6): 761-772.DOI: 10.3778/j.issn.1673-9418.1508055

基于LCA分块算法的大学科研人员信息抽取

易晨辉1+，刘梦赤1，胡婕2

1. 武汉大学计算机学院，武汉 430072
2. 湖北大学计算机与信息工程学院，武汉 430062

出版日期:2016-06-01 发布日期:2016-06-07

Information Extraction of University Research Faculty Based on LCA Segmentation Algorithm

YI Chenhui1+, LIU Mengchi1, HU Jie2

1. School of Computer, Wuhan University, Wuhan 430072, China
2. School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China

Online:2016-06-01 Published:2016-06-07

摘要/Abstract

摘要： 现有的半结构化网页信息抽取方法主要假设有效数据间具有较强结构相似性，将网页分割为具有类似特征的数据记录与数据区域然后进行抽取。但是存有大学科研人员信息的网页大多是人工编写填入内容，结构特征并不严谨。针对这类网页的弱结构性，提出了一种基于最近公共祖先（lowest common ancestor，LCA）分块算法的人员信息抽取方法，将LCA和语义相关度强弱的联系引入网页分块中，并提出了基本语义块与有效语义块的概念。在将网页转换成文档对象模型（document object model，DOM）树并进行预处理后，首先通过向上寻找LCA节点的方法将页面划分为基本语义块，接着结合人员信息的特征将基本语义块合并为存有完整人员信息的有效语义块，最后根据有效语义块的对齐获取当前页面所有关系映射的人员信息。实验结果表明，该方法在大量真实的大学人员网页的分块与抽取中，与MDR（mining data records）算法相比仍能保持较高的准确率与召回率。

关键词: 信息抽取, 最近公共祖先（LCA）, 基本语义块, 有效语义块, 关系映射

Abstract: Conventional information extraction methods of semi-structured pages usually assume that valid data have relatively strong structural similarity, divide the page into data records and data region with similar characteristics and then extract from them. However, faculty list pages of universities mostly are written artificially and filled by human beings instead of automatic generation by using templates, so their structure is not rigorous. This paper proposes a faculty information extraction method based on LCA (lowest common ancestor) segmentation algorithm, introduces the connection between LCA and semantic relation into Web segmentation, and presents the new concepts of basic semantic blocks and effective semantic blocks. After converting the page into a DOM (document object model) tree and the preprocessing, the page is divided into the basic semantic blocks with LCA algorithm firstly. Then the basic semantic blocks are merged into their corresponding effective semantic blocks with complete personnel information. Finally, according to the alignment of effective semantic blocks, all faculty information mapped by all relationships in current page is gotten. The experimental results show that the proposed method still has high precision and recall rates in the segmentation and extraction of quantities of real university research faculty list pages by compared with the MDR (mining data records) algorithm.

Key words: information extraction, lowest common ancestor (LCA), basic semantic block, effective semantic block, relational mapping

易晨辉，刘梦赤，胡婕. 基于LCA分块算法的大学科研人员信息抽取[J]. 计算机科学与探索, 2016, 10(6): 761-772.

YI Chenhui, LIU Mengchi, HU Jie. Information Extraction of University Research Faculty Based on LCA Segmentation Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(6): 761-772.

113

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	113

来源	本网站	其他网站

次数	112	1
比例	99%	1%

摘要

214

最新录用	在线预览	正式出版

0	0	214

	来源	本网站

	次数	214
	比例	100%

[1]	付博，刘挺. 基于跨社交媒体检索的微博消费对象识别[J]. 计算机科学与探索, 2015, 9(10): 1247-1255.
[2]	王海涛，张志亮，孙煜华，袁春风，黄宜华. Web信息抽取网页自动浏览导航与集成规则研究[J]. 计算机科学与探索, 2014, 8(9): 1049-1066.
[3]	张伟, 宋晖, 黄罡. 无线传感设备及数据的对象化访问方法[J]. 计算机科学与探索, 2011, 5(12): 1076-1084.

基于LCA分块算法的大学科研人员信息抽取

Information Extraction of University Research Faculty Based on LCA Segmentation Algorithm

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐 0

Metrics