计算机科学与探索

• 学术研究 •    下一篇

基于全域跨语义融合的多级酶功能预测

周汉文, 邓赵红,张炜   

  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122

Global and Cross-Semantic Aggregation for Multi-level Enzyme Function Prediction

ZHOU Hanwen,  DENG Zhaohong,  ZHANG Wei   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China

摘要: 蛋白质在生物活动中发挥着关键作用,酶作为一种重要的蛋白质,因其催化功能在多个领域得到广泛应用。然而,通过生化实验验证酶的功能既费时又昂贵。传统的酶功能注释方法主要依赖于序列相似性,但在目标酶序列与已知酶差异较大时,这些方法效果不佳。近年来,科研人员初步探索了一些基于深度学习的方法,但现有的深度学习方法受限于传统酶序列编码方式,并且仅利用单一视图或单层次的信息,这使得模型在处理结构复杂或功能多样的酶时表现出一定的局限性。针对此,本文提出了一种全新的全域跨语义融合的多级酶功能预测方法(GCMEFP)。具体地,所提方法使用了两种最新的蛋白质大语言模型进行序列词嵌入学习。同时,所提方法构建了多语义深度特征学习模块,该模块通过卷积神经网络构建语义金字塔,实现了不同层级语义信息的提取。进一步地,所提方法还提出了全域跨视图语义融合模块,用于探索不同视图之间隐藏的相互作用信息,并去除冗余信息来增强模型的泛化性。实验结果表明:提出的GCMEFP在基准数据集上的精度达到89.6%,较现有最优方法高出4.8%;在独立测试集New-379上的精度达到55%,较现有最优方法高出14%。

关键词: 多级酶功能预测、多语义深度特征学习、大模型词嵌入、多视图特征融合

Abstract: Proteins play a crucial role in biological activities, with enzymes, as significant proteins, being widely utilized across various fields due to their catalytic functions. However, verifying enzyme functions through biochemical experiments is both time-consuming and expensive. Traditional methods for enzyme function annotation primarily rely on sequence similarity, which proves ineffective when the target enzyme sequence significantly differs from known enzymes. Recently, researchers have begun to explore deep learning-based methods, but these approaches are constrained by traditional enzyme sequence encoding methods and typically utilize only a single perspective or level of information. Consequently, these models face limitations when addressing enzymes with complex structures or diverse functions. To overcome these challenges, this paper proposes a novel full-domain cross-semantic fusion multi-level enzyme function prediction method (GCMEFP). Specifically, the proposed method employs two state-of-the-art protein macrolanguage models for sequence word embedding learning. Additionally, it constructs a multi-semantic deep feature learning module that utilizes a convolutional neural network to build a semantic pyramid, enabling the extraction of semantic information at various levels. Furthermore, the method introduces a full-domain cross-view semantic fusion module to explore hidden interaction information between different views and eliminate redundant information, thereby enhancing the model's generalization. Experimental results demonstrate that the proposed GCMEFP achieves 89.6% accuracy on the benchmark dataset, which is 4.8% higher than the existing optimal method, and 55% accuracy on the independent test set New-379, which is 14% higher than the existing optimal method.

Key words: multi-level enzyme function prediction, multi-view deep feature learning, large language model word embedding, multi-view feature aggregation