计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (5): 794-803.DOI: 10.3778/j.issn.1673-9418.1705031

• 人工智能与模式识别 • 上一篇    下一篇

基于遗传算法的蛋白质复合物识别算法

郑文萍1,2,3+,李晋玉1,王  杰1,2   

  1. 1. 山西大学 计算机与信息技术学院,太原 030006
    2. 山西大学 计算智能与中文信息处理教育部重点实验室,太原 030006
    3. 山西大学 大数据挖掘与智能技术山西省协同创新技术中心,太原 030006
  • 出版日期:2018-05-01 发布日期:2018-05-07

Protein Complex Recognition Algorithm Based on Genetic Algorithm

ZHENG Wenping1,2,3+, LI Jinyu1, WANG Jie1,2   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
    2. Key Laboratory of Computation Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
    3. Collaborative Innovation Center of Shanxi Province for Big Data Mining and Intelligence Technology, Shanxi University, Taiyuan 030006, China
  • Online:2018-05-01 Published:2018-05-07

摘要: 蛋白质互作用网络是一种典型的复杂网络,呈现了明显的社区结构。网络中的社区对应于功能模块,通常被看作蛋白质复合物。蛋白质复合物识别对预测蛋白质功能,解释特定生物进程具有重要作用。基于种子节点扩展的图聚类方法在蛋白质复合物识别中应用广泛。针对此类算法最终结果受种子节点的影响较大,并且在簇的形成过程中搜索空间有限等问题,提出了一种基于遗传算法的蛋白质复合物识别算法GAGC(genetic algorithm based graph clustering),其中个体表示聚类结果(类别之间可能存在重叠节点),以F-measure值作为种群进化的目标函数。算法采用IPCA(improvement development clustering algorithm)算法产生初始种群;针对初始种群,设计了染色体对齐方式以进行交叉操作产生下一代种群。通过与DPClus、MCODE、IPCA、ClusterOne、HC-PIN、CFinder等经典算法的对比实验表明,GAGC算法能够扩大图聚类算法的搜索空间,提高解的多样性,进而提高蛋白质复合物检测的性能。

关键词: 蛋白质互作用网络, 遗传算法, 图聚类, 蛋白质复合物

Abstract: The protein interaction network is a typical complex network that presents a distinct community structure. The communities in the network correspond to functional modules and are often considered as protein complexes.    Protein complex recognition plays an important role in predicting protein function and explaining specific biological processes. The clustering method based on seed node expansion is widely used in protein complex recognition. Aiming at the problems that the final result of this algorithm is influenced by seed nodes, and the search space is limited in the process of cluster formation, this paper presents a protein complex recognition algorithm GAGC (genetic algorithm based graph clustering), where the individual represents the clustering result (there may be overlapping nodes between categories), and the F-measure value is used as the objective function of population evolution. The algorithm uses the IPCA (improvement development clustering algorithm) to generate the initial population; for the initial population, chromosome alignment is designed to cross the operation to produce the next generation population. Comparing with the classical algorithms such as DPClus, MCODE, IPCA, ClusterOne, HC-PIN and CFinder, GAGC algorithm can expand the search space of graph clustering algorithm, improve the diversity of solution, and then improve the performance of protein complex detection.

Key words: protein-protein interaction (PPI) network, genetic algorithm, graph clustering, protein complex