计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (7): 1154-1164.DOI: 10.3778/j.issn.1673-9418.1806013

• 人工智能 • 上一篇    下一篇

基于Katz增强归纳型矩阵补全的基因-疾病关联关系预测

浦建宇1,陈  蕾1,2,3+,邵  楷1   

  1. 1.南京邮电大学 计算机学院,南京 210023
    2.江苏省无线传感网高技术研究重点实验室,南京 210023
    3.南京航空航天大学 计算机科学与技术学院,南京 210016
  • 出版日期:2019-07-01 发布日期:2019-07-08

Exploiting Katz Method to Boost Inductive Matrix Completion for Predicting Gene-Disease Associations

PU Jianyu1, CHEN Lei1,2,3+, SHAO Kai1   

  1. 1.School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
    2.Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing 210023, China
    3.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
  • Online:2019-07-01 Published:2019-07-08

摘要: 基因-疾病关联关系预测已经成为当前生物医学研究的一个热点。现有的关联预测方法通常会遭受基因-疾病关联数据稀疏和PU(positive and unlabeled)问题的影响。基于以上不足,提出一种基于Katz增强归纳型矩阵补全的基因-疾病关联预测模型。该模型由基于Katz方法的预估计和基于归纳型矩阵补全方法的精化估计两个步骤组成。具体地,先利用Katz方法基于基因-疾病异构网络对基因-疾病关联进行预估计,以期缓解关联数据稀疏和PU问题的影响。然而,受制于相似度网络的质量,Katz方法在预估计基因-疾病关联时不可避免地会引入一些噪声,为此,将弹性网正则化技术引入传统的归纳型矩阵补全模型以增强其鲁棒性,进而用改进的归纳型矩阵补全模型来精化基因-疾病关联预测效果。实验结果表明,与目前流行的基因-疾病关联预测方法相比,所提出的模型在查全率和查准率上均有显著提高,同时也能解决关联预测中常见的冷启动问题。

关键词: 基因-疾病关联预测, 矩阵补全, 异构信息网络, 弹性网正则化, 生物医学信息处理

Abstract: Predicting gene-disease associations has been a focus in current biomedical research. Most existing methods suffer from the sparsity of gene-disease associations and PU (positive and unlabeled) problem. Therefore, a new algorithm called KIMC (Katz method to boost inductive matrix completion) has been proposed to predict gene-disease associations. The model consists of two steps: pre-estimation based on Katz method and refined estimation based on inductive matrix completion method. It first exploits Katz method to estimate gene-disease associations based on gene-disease heterogeneous networks. This step can alleviate the effect caused by the sparsity of gene-disease associations and PU problem. However, subject to the quality of the similarity network, the Katz method inevitably introduces some noise. Then, to address the challenge, this paper introduces the elastic-net regularization into IMC (inductive matrix completion) to enhance robustness and improve the prediction of gene-disease associations. The experimental results on real datasets show that the method achieves significantly superior precision and recall rates compared with several state-of-the-art models. Meanwhile, this method can solve the cold start problem.

Key words: gene-disease association prediction, matrix completion, heterogeneous information networks, elastic-net regularization, biomedical information processing