计算机科学与探索

• 学术研究 •    下一篇

TPLADD:高鲁棒性与高精度的C/C++第三方库检测方法

贾昀峰, 王俊峰, 吴鹏   

  1. 1.四川大学,计算机学院(软件学院、智能科学与技术学院),成都 610065
    2.四川旅游学院,信息与工程学院,成都 610100

TPLADD: A Highly Robust and Precise C/C++ Third-Party Libraries Detection Method

JIA Yunfeng, WANG Junfeng, WU Peng   

  1. 1.Sichuan University, School of Computer Science (School of Software, School of Intelligent Science and Technology), Chengdu 610065, China.
    2.Sichuan Tourism University, School of Information and Engineering, Chengdu 610100, China

摘要: 第三方库(Third-party libraries,TPL)作为现代C/C++软件开发的重要组成部分,其精确检测与管理对于保障软件质量与安全性至关重要。然而,现有方法主要依赖代码语法特征,对Type II和Type III克隆重用场景的适应性不足,易导致检测失效。为此,本文提出一种基于函数抽象语法树(Abstract syntax tree,AST)特征的TPL检测方法——TPLADD。该方法利用AST节点度数与次序的度量信息快速实现函数语法向量嵌入,并结合向量数据库与近似最近邻检索技术,显著提升了修改重用场景下的检测鲁棒性。此外,基于异常检测的过滤技术有效减少干扰函数对检测的影响,提高结果精确性。基于GitHub搜集的29782个开源软件(Open-source software,OSS)共计726074个版本,构建了特征向量索引库,并在100个知名项目上验证有效性。结果表明,在精度上,TPLADD相较于CENTRIS,精确率和召回率分别提升了3.88%和2.76%;在鲁棒性上,TPLADD即使出现较大程度代码修改时,仍能保持74%的F1值;在性能上,TPLADD平均每个TPL检测耗时仅0.42s,索引库存储占用率仅为总体函数特征的0.41%,充分体现了其高鲁棒性、高精确性的特点,且具备良好的性能表现。

关键词: 开源软件, 软件组件分析, 第三方库检测, 代码克隆, 修改重用, 静态分析

Abstract: Third-party libraries (TPLs) are a crucial component of modern C/C++ software development, and their precise detection and management are essential for ensuring software quality and security. However, existing methods primarily rely on code syntax features, which have limited adaptability to Type II and Type III clone reuse scenarios, often leading to detection failures. To address this, we propose a new TPL detection method based on function Abstract Syntax Tree (AST) features, named TPLADD. This method leverages the degree and order metrics of AST nodes to quickly perform vector embeddings of function syntax and integrates vector databases with Approximate Nearest Neighbor (ANN) search techniques, significantly enhancing the detection robustness in modified reuse scenarios. Additionally, an anomaly-based filtering technique effectively reduces the impact of interference functions, improving detection accuracy. A feature vector index library was constructed based on 29,782 open-source software (OSS) projects and 726,074 versions collected from GitHub, and its effectiveness was validated on 100 well-known projects. Experimental results show that, in terms of precision, TPLADD outperforms CENTRIS with an improvement of 3.88% in precision and 2.76% in recall. In terms of robustness, even with significant code modifications, TPLADD maintains an F1 score of 74%. In terms of performance, the average detection time for each TPL is only 0.42s, and the index library occupies only 0.41% of the total function feature storage, demonstrating its high robustness, accuracy, and good performance.

Key words: open-source software, software composition analysis, third-party library detection, code cloning, modified reuse, static analysis