计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (7): 802-811.DOI: 10.3778/j.issn.1673-9418.1312024

• 数据库技术 • 上一篇    下一篇

支持隐私保护的众包实体解析

燕彩蓉1+,张洋舜1,徐光伟1,2   

  1. 1. 东华大学 计算机科学与技术学院,上海 201620
    2. 同济大学 嵌入式系统与服务计算教育部重点实验室,上海 200092
  • 出版日期:2014-07-01 发布日期:2014-07-02

Crowdsourcing Entity Resolution with Privacy Protection

YAN Cairong1+, ZHANG Yangshun1, XU Guangwei1,2   

  1. 1. School of Computer Science and Technology, Donghua University, Shanghai 201620, China
    2. Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 200092, China
  • Online:2014-07-01 Published:2014-07-02

摘要: 实体解析是指发现并聚合描述现实世界中同一对象的记录。纯粹的机器算法虽然可以获得较高的效率,但是准确率难以保证。提出了一种机器计算与众包相结合的实体解析方法。该方法首先采用MapReduce并行计算框架排除不可能匹配的记录对,减少人类智能任务的数量,然后由人工进行确定性标注。为了支持隐私保护,在众包计算时提出了基于角色的访问控制模型和重要信息隐藏策略。该方法和模型被应用于某医院患者主索引构建平台,实验结果表明,人机结合方法充分利用了机器和人工处理的优势,可以进行高效率和高精度的患者实体解析,并且有效地避免了患者信息的泄漏。

关键词: 实体解析, 众包, MapReduce编程模型, 隐私保护, 患者主索引

Abstract: Entity resolution is to find and cluster records that refer to the same real-world object. It can be an extremely difficult process to get high accuracy for computer algorithms alone although they can bring high efficiency. This paper proposes a hybrid approach combining machine processing with crowdsourcing for entity resolution. Firstly the record pairs that are impossible to match are excluded by MapReduce-based parallel computing framework so as to reduce the number of human intelligence tasks, and then those ambiguous record pairs are labeled by human operation. A role-based access control model and related information hiding strategies are adopted for privacy protection during the crowdsourcing sessions. The approach and the model are applied on the master patient index building platform for a hospital. The experimental results show that they make full use of the advantages of machine-based and human-based processing ways, bring high efficiency and accuracy for patient entity resolution, and avoid the leakage of patient information.

Key words: entity resolution, crowdsourcing, MapReduce programming model, privacy protection, master patient index