TPLADD：高鲁棒性与高精度的C/C++第三方库检测方法

doi:10.3778/j.issn.1673-9418.2409052

摘要/Abstract

摘要： 第三方库（Third-party libraries，TPL）作为现代C/C++软件开发的重要组成部分，其精确检测与管理对于保障软件质量与安全性至关重要。然而，现有方法主要依赖代码语法特征，对Type II和Type III克隆重用场景的适应性不足，易导致检测失效。为此，本文提出一种基于函数抽象语法树（Abstract syntax tree，AST）特征的TPL检测方法——TPLADD。该方法利用AST节点度数与次序的度量信息快速实现函数语法向量嵌入，并结合向量数据库与近似最近邻检索技术，显著提升了修改重用场景下的检测鲁棒性。此外，基于异常检测的过滤技术有效减少干扰函数对检测的影响，提高结果精确性。基于GitHub搜集的29782个开源软件（Open-source software，OSS）共计726074个版本，构建了特征向量索引库，并在100个知名项目上验证有效性。结果表明，在精度上，TPLADD相较于CENTRIS，精确率和召回率分别提升了3.88%和2.76%；在鲁棒性上，TPLADD即使出现较大程度代码修改时，仍能保持74%的F1值；在性能上，TPLADD平均每个TPL检测耗时仅0.42s，索引库存储占用率仅为总体函数特征的0.41%，充分体现了其高鲁棒性、高精确性的特点，且具备良好的性能表现。

关键词: 开源软件, 软件组件分析, 第三方库检测, 代码克隆, 修改重用, 静态分析

Abstract: Third-party libraries (TPLs) are a crucial component of modern C/C++ software development, and their precise detection and management are essential for ensuring software quality and security. However, existing methods primarily rely on code syntax features, which have limited adaptability to Type II and Type III clone reuse scenarios, often leading to detection failures. To address this, we propose a new TPL detection method based on function Abstract Syntax Tree (AST) features, named TPLADD. This method leverages the degree and order metrics of AST nodes to quickly perform vector embeddings of function syntax and integrates vector databases with Approximate Nearest Neighbor (ANN) search techniques, significantly enhancing the detection robustness in modified reuse scenarios. Additionally, an anomaly-based filtering technique effectively reduces the impact of interference functions, improving detection accuracy. A feature vector index library was constructed based on 29,782 open-source software (OSS) projects and 726,074 versions collected from GitHub, and its effectiveness was validated on 100 well-known projects. Experimental results show that, in terms of precision, TPLADD outperforms CENTRIS with an improvement of 3.88% in precision and 2.76% in recall. In terms of robustness, even with significant code modifications, TPLADD maintains an F1 score of 74%. In terms of performance, the average detection time for each TPL is only 0.42s, and the index library occupies only 0.41% of the total function feature storage, demonstrating its high robustness, accuracy, and good performance.

Key words: open-source software, software composition analysis, third-party library detection, code cloning, modified reuse, static analysis

贾昀峰, 王俊峰, 吴鹏. TPLADD：高鲁棒性与高精度的C/C++第三方库检测方法[J]. 计算机科学与探索, DOI: 10.3778/j.issn.1673-9418.2409052.

JIA Yunfeng, WANG Junfeng, WU Peng. TPLADD: A Highly Robust and Precise C/C++ Third-Party Libraries Detection Method[J]. Journal of Frontiers of Computer Science and Technology, DOI: 10.3778/j.issn.1673-9418.2409052.

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	6	0	0

	来源	本网站

	次数	6
	比例	100%

摘要

最新录用	在线预览	正式出版

15	0	0

	来源	本网站

	次数	15
	比例	100%

[1]	汪哲, 任怡, 周凯, 管剑波, 谭郁松. 基于代码克隆检测的操作系统脆弱性分析方法[J]. 计算机科学与探索, 2021, 15(9): 1619-1631.
[2]	郭帆, 范威威. 面向Java EE程序的SQLIA漏洞分析和验证方法[J]. 计算机科学与探索, 2021, 15(2): 270-283.
[3]	闫鑫，周宇，黄志球. 基于序列到序列模型的代码片段推荐[J]. 计算机科学与探索, 2020, 14(5): 731-739.
[4]	曾杰，贲可荣，张献，李晓伟，周全. 基于程序向量树的代码克隆检测[J]. 计算机科学与探索, 2020, 14(10): 1656-1669.
[5]	王靖瑜，徐明昆，王浩宇，徐国爱. Android应用隐私条例与敏感行为一致性检测[J]. 计算机科学与探索, 2019, 13(1): 56-69.
[6]	李文鹏，王建彬，林泽琦，赵俊峰，邹艳珍，谢冰. 面向开源软件项目的软件知识图谱构建方法[J]. 计算机科学与探索, 2017, 11(6): 851-862.
[7]	韩俊明，王炜，李彤，何云. 面向开源软件的演化确认方法[J]. 计算机科学与探索, 2017, 11(4): 539-555.
[8]	郭颖，陈峰宏，周明辉. 大规模代码克隆的检测方法[J]. 计算机科学与探索, 2014, 8(4): 417-426.
[9]	王海，林云，彭鑫，赵文耘. 基于分组的代码克隆增量检测方法[J]. 计算机科学与探索, 2014, 8(4): 446-455.
[10]	姜加红，陈立前，王戟. 基于浮点区间幂集抽象域的浮点程序分析[J]. 计算机科学与探索, 2013, 7(3): 209-217.
[11]	尹刚, 王怀民, 袁霖, 朱沿旭, 史殿习, 米海波<SPAN style=. 构造基于互联网的可信软件生产服务系统 [J]. 计算机科学与探索, 2011, 5(10): 880-890.

TPLADD：高鲁棒性与高精度的C/C++第三方库检测方法

TPLADD: A Highly Robust and Precise C/C++ Third-Party Libraries Detection Method

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 11

编辑推荐 0

Metrics