Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (7): 1969-1980.DOI: 10.3778/j.issn.1673-9418.2409052

• Practice ·Applications • Previous Articles    

TPLADD: Highly Robust and Precise C/C++ Third-Party Libraries Detection Method

JIA Yunfeng, WANG Junfeng, WU Peng   

  1. 1. School of Computer Science (School of Software, School of Intelligent Science and Technology), Sichuan University, Chengdu 610065, China
    2. School of Information and Engineering, Sichuan Tourism University, Chengdu 610100, China
  • Online:2025-07-01 Published:2025-06-30

TPLADD:高鲁棒性与高精度的C/C++第三方库检测方法

贾昀峰,王俊峰,吴鹏   

  1. 1. 四川大学 计算机学院(软件学院、智能科学与技术学院),成都 610065
    2. 四川旅游学院 信息与工程学院,成都 610100

Abstract: Third-party libraries (TPLs) are a crucial component of modern C/C++ software development, and their precise detection and management are essential for ensuring software quality and security. However, existing methods primarily rely on code syntax features, which have limited adaptability to Type Ⅱ and Type Ⅲ clone reuse scenarios, often leading to detection failures. This paper proposes a new TPL detection method based on function abstract syntax tree (AST) features, named TPLADD (third-party library approximate detection with deduplication). This method leverages the degree and order metrics of AST nodes to quickly perform vector embeddings of function syntax, and integrates vector databases with approximate nearest neighbor search techniques, significantly enhancing the detection robustness in modified reuse scenarios. Additionally, an anomaly-based filtering technique effectively reduces the impact of interference functions, improving detection accuracy. A feature vector index library is constructed based on 29782 open-source software (OSS) projects and 726074 versions collected from GitHub, and its effectiveness is validated on 100 well-known projects. Experimental results show that: in terms of precision, TPLADD outperforms CENTRIS with an improvement of 3.88 percentage points in precision and 2.76 percentage points in recall; in terms of robustness, even with significant code modifications, TPLADD maintains an F1 score of 74%; in terms of performance, the average detection time for each TPL is only 0.42 s, and the index library occupies only 0.41% of the total function feature storage. The above results demonstrate the high robustness, accuracy, and good performance of TPLADD.

Key words: open-source software, software composition analysis, third-party library detection, code cloning, modified reuse, static analysis

摘要: 第三方库(TPL)作为现代C/C++软件开发的重要组成部分,其精确检测与管理对于保障软件质量与安全性至关重要。然而现有方法主要依赖代码语法特征,对Type Ⅱ和Type Ⅲ克隆重用场景的适应性不足,易导致检测失效。提出一种基于函数抽象语法树(AST)特征的TPL检测方法TPLADD。该方法利用AST节点度数与次序的度量信息快速实现函数语法向量嵌入,并结合向量数据库与近似最近邻检索技术,显著提升了修改重用场景下的检测鲁棒性。基于异常检测的过滤技术可以有效减少干扰函数对检测的影响,提高结果精确性。基于GitHub搜集的29?782个开源软件(OSS)共计726?074个版本,构建了特征向量索引库,并在100个知名项目上验证有效性。实验结果表明,在精度上,TPLADD相较于CENTRIS,精确率和召回率分别提升了3.88和2.76个百分点;在鲁棒性上,TPLADD即使出现较大程度代码修改时,仍能保持74%的F1值;在性能上,TPLADD平均每个TPL检测耗时仅0.42?s,索引库存储占用率仅为总体函数特征的0.41%。这些充分体现了TPLADD高鲁棒性、高精确性的特点,且具备良好的性能表现。

关键词: 开源软件, 软件组件分析, 第三方库检测, 代码克隆, 修改重用, 静态分析