Journal of Frontiers of Computer Science and Technology ›› 2016, Vol. 10 ›› Issue (6): 761-772.DOI: 10.3778/j.issn.1673-9418.1508055

Previous Articles     Next Articles

Information Extraction of University Research Faculty Based on LCA Segmentation Algorithm

YI Chenhui1+, LIU Mengchi1, HU Jie2   

  1. 1. School of Computer, Wuhan University, Wuhan 430072, China
    2. School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China
  • Online:2016-06-01 Published:2016-06-07

基于LCA分块算法的大学科研人员信息抽取

易晨辉1+,刘梦赤1,胡  婕2   

  1. 1. 武汉大学 计算机学院,武汉 430072
    2. 湖北大学 计算机与信息工程学院,武汉 430062

Abstract: Conventional information extraction methods of semi-structured pages usually assume that valid data have relatively strong structural similarity, divide the page into data records and data region with similar characteristics and then extract from them. However, faculty list pages of universities mostly are written artificially and filled by human beings instead of automatic generation by using templates, so their structure is not rigorous. This paper proposes a faculty information extraction method based on LCA (lowest common ancestor) segmentation algorithm, introduces the connection between LCA and semantic relation into Web segmentation, and presents the new concepts of basic semantic blocks and effective semantic blocks. After converting the page into a DOM (document object model) tree and the preprocessing, the page is divided into the basic semantic blocks with LCA algorithm firstly. Then the basic semantic blocks are merged into their corresponding effective semantic blocks with complete personnel information. Finally, according to the alignment of effective semantic blocks, all faculty information mapped by all relationships in current page is gotten. The experimental results show that the proposed method still has high precision and recall rates in the segmentation and extraction of quantities of real university research faculty list pages by compared with the MDR (mining data records) algorithm.

Key words: information extraction, lowest common ancestor (LCA), basic semantic block, effective semantic block, relational mapping

摘要: 现有的半结构化网页信息抽取方法主要假设有效数据间具有较强结构相似性,将网页分割为具有类似特征的数据记录与数据区域然后进行抽取。但是存有大学科研人员信息的网页大多是人工编写填入内容,结构特征并不严谨。针对这类网页的弱结构性,提出了一种基于最近公共祖先(lowest common ancestor,LCA)分块算法的人员信息抽取方法,将LCA和语义相关度强弱的联系引入网页分块中,并提出了基本语义块与有效语义块的概念。在将网页转换成文档对象模型(document object model,DOM)树并进行预处理后,首先通过向上寻找LCA节点的方法将页面划分为基本语义块,接着结合人员信息的特征将基本语义块合并为存有完整人员信息的有效语义块,最后根据有效语义块的对齐获取当前页面所有关系映射的人员信息。实验结果表明,该方法在大量真实的大学人员网页的分块与抽取中,与MDR(mining data records)算法相比仍能保持较高的准确率与召回率。

关键词: 信息抽取, 最近公共祖先(LCA), 基本语义块, 有效语义块, 关系映射