Global and Cross-Semantic Aggregation for Multi-level Enzyme Function Prediction

doi:10.3778/j.issn.1673-9418.2407060

Abstract

Abstract: Proteins play a crucial role in biological activities, with enzymes, as significant proteins, being widely utilized across various fields due to their catalytic functions. However, verifying enzyme functions through biochemical experiments is both time-consuming and expensive. Traditional methods for enzyme function annotation primarily rely on sequence similarity, which proves ineffective when the target enzyme sequence significantly differs from known enzymes. Recently, researchers have begun to explore deep learning-based methods, but these approaches are constrained by traditional enzyme sequence encoding methods and typically utilize only a single perspective or level of information. Consequently, these models face limitations when addressing enzymes with complex structures or diverse functions. To overcome these challenges, this paper proposes a novel global and cross-semantic aggregation multi-level enzyme function prediction method (GCMEFP). Specifically, the proposed method employs two state-of-the-art protein large language models for sequence word embedding learning. Additionally, it constructs a multi-semantic deep feature learning module that utilizes a convolutional neural network to build a semantic pyramid, enabling the extraction of semantic information at various levels. Furthermore, the method introduces a full-domain cross-view semantic fusion module to explore hidden interaction information between different views and eliminate redundant information, thereby enhancing the model’s generalization. Experimental results demonstrate that the proposed GCMEFP achieves 89.6% accuracy on the benchmark dataset, which is 0.048 higher than the existing optimal method, and 55.6% accuracy on the independent test set New-379, which is 0.14 higher than the existing optimal method.

Key words: multi-level enzyme function prediction, multi-semantic deep feature learning, large language model word embedding, multi-view feature aggregation

摘要： 蛋白质在生物活动中发挥着关键作用，酶作为一种重要的蛋白质，因其催化功能在多个领域得到广泛应用。然而，通过生化实验验证酶的功能既费时又昂贵。传统的酶功能注释方法主要依赖于序列相似性，但在目标酶序列与已知酶差异较大时，这些方法效果不佳。近年来，科研人员初步探索了一些基于深度学习的方法，但现有的深度学习方法受限于传统酶序列编码方式，并且仅利用单一视图或单层次的信息，这使得模型在处理结构复杂或功能多样的酶时表现出一定的局限性。对此，提出一种新的全域跨语义融合的多级酶功能预测方法（GCMEFP）。所提方法使用了两种最新的蛋白质大语言模型进行序列词嵌入学习。构建了多语义深度特征学习模块，该模块通过卷积神经网络构建语义金字塔，实现了不同层级语义信息的提取。还提出了全域跨视图语义融合模块，用于探索不同视图之间隐藏的相互作用信息，并去除冗余信息来增强模型的泛化性。实验结果表明：提出的GCMEFP在基准数据集上的精度达到89.6%，较现有最优方法高出0.048；在独立测试集New-379上的精度达到55.6%，较现有最优方法高出0.14。

关键词: 多级酶功能预测, 多语义深度特征学习, 大模型词嵌入, 多视图特征融合

ZHOU Hanwen, DENG Zhaohong, ZHANG Wei. Global and Cross-Semantic Aggregation for Multi-level Enzyme Function Prediction[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1588-1597.

周汉文, 邓赵红, 张炜. 基于全域跨语义融合的多级酶功能预测[J]. 计算机科学与探索, 2025, 19(6): 1588-1597.

References

[1] GODDARD J P, REYMOND J L. Enzyme assays for high-throughput screening[J]. Current Opinion in Biotechnology, 2004, 15(4): 314-322.
[2] CONSORTIUM U. UniProt: the universal protein knowledgebase in 2023[J]. Nucleic Acids Research, 2023, 51: 523-531.
[3] FURNHAM N, GARAVELLI J S, APWEILER R, et al. Missing in action: enzyme functional annotations in biological databases[J]. Nature Chemical Biology, 2009, 5: 521-525.
[4] JEFFERY C J. Protein moonlighting: what is it, and why is it important?[J]. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences, 2018, 373(1738): 20160523.
[5] CORNISH-BOWDEN A. Current IUBMB recommendations on enzyme nomenclature and kinetics[J]. Perspectives in Science, 2014, 1: 74-87.
[6] HUNG J H, WENG Z P. Sequence alignment and homology search with BLAST and ClustalW[J]. Cold Spring Harbor Protocols, 2016(11).
[7] KUMAR C, CHOUDHARY A. A top-down approach to classify enzyme functional classes and sub-classes using random forest[J]. EURASIP Journal on Bioinformatics & Systems Biology, 2012(1).
[8] 郑征帆, 吕艳杰, 宁黔冀. 甲壳动物表皮几丁质结合蛋白结构与功能研究进展[J]. 水产科学, 2017, 36(4): 538-542.
ZHENG Z F, LYU Y J, NING Q J. Research progress on structure and function of crustacean cuticular chitin-binding proteins: a review[J]. Fisheries Science, 2017, 36(4): 538-542.
[9] ARAKAKI A K, HUANG Y, SKOLNICK J. EFICAz2: enzyme function inference by a combined approach enhanced by machine learning[J]. BMC Bioinformatics, 2009, 10: 107.
[10] DALKIRAN A, RIFAIOGLU A S, MARTIN M J, et al. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature[J]. BMC Bioinformatics, 2018, 19(1): 334.
[11] LI H Y, TIAN S Y, LI Y, et al. Modern deep learning in bioinformatics[J]. Journal of Molecular Cell Biology, 2020, 12(11): 823-827.
[12] LI H, DENG Z H, YANG H T, et al. circRNA-binding protein site prediction based on multi-view deep learning, subspace learning and multi-view classifier[J]. Briefings in Bioinformatics, 2022, 23(1): bbab394.
[13] RYU J Y, KIM H U, LEE S Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers[J]. Proceedings of the National Academy of Sciences of the United States of America, 2019, 116(28): 13996-14001.
[14] GU J X, WANG Z H, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[15] NALLAPAREDDY M V, DWIVEDULA R. ABLE: attention based learning for enzyme classification[J]. Computational Biology and Chemistry, 2021, 94: 107558.
[16] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [EB/OL]. [2024-05-16]. https://arxiv.org/abs/1301.3781v1.
[17] YANG H T, DENG Z H, PAN X Y, et al. RNA-binding protein recognition based on multi-view deep feature and multi-label learning[J]. Briefings in Bioinformatics, 2021, 22(3): bbaa174.
[18] WU Q Z, DENG Z H, PAN X Y, et al. MDGF-MCEC: a multi-view dual attention embedding model with cooperative ensemble learning for CircRNA-disease association prediction[J]. Briefings in Bioinformatics, 2022, 23(5): bbac289.
[19] TANG W L, DENG Z H, ZHOU H W, et al. MVDINET: a novel multi-level enzyme function predictor with multi-view deep interactive learning[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2024, 21(1): 84-94.
[20] SCH?FFER A A, ARAVIND L, MADDEN T L, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements[J]. Nucleic Acids Research, 2001, 29(14): 2994-3005.
[21] YANG K K, WU Z, BEDBROOK C N, et al. Learned protein embeddings for machine learning[J]. Bioinformatics, 2018, 34(23): 4138.
[22] WEISSENOW K, HEINZINGER M, ROST B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction[J]. Structure, 2022, 30(8): 1169-1177.
[23] LIN Z M, AKIN H, RAO R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model[J]. Science, 2023, 379(6637): 1123-1130.
[24] ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning, 2013: 1247-1255.
[25] DHILLON P, FOSTER D P, UNGAR L. Multi-view learning of word embeddings via CCA[C]//Advances in Neural Information Processing Systems 24, 2011: 199-207.
[26] LE-KHAC P H, HEALY G, SMEATON A F. Contrastive representation learning: a framework and review[J]. IEEE Access, 2020, 8: 193907-193934.
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[28] HUANG Y, NIU B, GAO Y, et al. CD-HIT Suite: a web server for clustering and comparing biological sequences[J]. Bioinformatics, 2010, 26(5): 680-682.
[29] CAVOJSKY M, DROZDA M, BALOGH Z. Analysis and experimental evaluation of the Needleman-Wunsch algorithm for trajectory comparison[J]. Expert Systems with Applications, 2021, 165: 114068.
[30] BRANDES N, OFER D, PELEG Y, et al. ProteinBERT: a universal deep-learning model of protein sequence and function[J]. Bioinformatics, 2022, 38(8): 2102-2110.
[31] LI Y, WANG S, UMAROV R, et al. DEEPre: sequence-based enzyme EC number prediction by deep learning[J]. Bioinformatics, 2018, 34(5): 760-769.
[32] KRISHNA K, NARASIMHA MURTY M. Genetic K-means algorithm[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 1999, 29(3): 433-439.
[33] MURTAGH F, CONTRERAS P. Algorithms for hierarchical clustering: an overview[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2012, 2(1): 86-97.
[34] GREENACRE M, GROENEN P J F, HASTIE T, et al. Principal component analysis[J]. Nature Reviews Methods Primers, 2022, 2: 100.
[35] SHALEM O, SANJANA N E, ZHANG F. High-throughput functional genomics using CRISPR-Cas9[J]. Nature Reviews Genetics, 2015, 16(5): 299-311.
[36] WANG Y J, XUE P, CAO M F, et al. Directed evolution: methodologies and applications[J]. Chemical Reviews, 2021, 121(20): 12384-12444.
[37] KIM G B, KIM W J, KIM H U, et al. Machine learning applications in systems metabolic engineering[J]. Current Opinion in Biotechnology, 2020, 64: 1-9.
[38] YU T H, BOOB A G, VOLK M J, et al. Machine learning-enabled retrobiosynthesis of molecules[J]. Nature Catalysis, 2023, 6: 137-151.