多阶段学习的SBERT单词级文本对抗性样本检测

doi:10.3778/j.issn.1673-9418.2409074

摘要/Abstract

摘要： 对抗性样本是在原样本上添加微小扰动，使模型以高置信度产生错误输出的样本。由于其在嵌入空间与原样本高度相似，检测难度较大。同时，大多数语言模型并非专为生成高质量嵌入向量设计，难以有效区分对抗性样本与正常样本，尤其在应对复杂的单词级对抗攻击时，细微的语义差异通常难以被捕捉，从而影响检测性能。针对这一局限，提出了一种创新的句子嵌入模型多阶段学习方法，系统优化SBERT模型的嵌入空间表达，显著放大对抗性样本与普通样本的差异性。第一阶段的训练通过对比学习增强SBERT的区分能力，使对抗性样本与正常样本表征分离；第二阶段的训练结合监督对比学习和多级噪声增强，进一步优化嵌入空间，使同类样本更紧密聚集、异类样本充分分离；第三阶段利用分类器将模型的嵌入向量映射为标签。实验在BERT和Mamba模型作为攻击目标的情况下，针对三种分类数据集和多种文本对抗性攻击类型进行测试，结果表明该方法在检测对抗性样本时效果优异，同时也具备出色的跨模型、跨攻击和跨数据集的泛化能力，为文本对抗性样本检测提供了新的方法和思路。

关键词: 文本对抗性样本检测, SBERT, 对比学习, 句子嵌入模型, 噪声增强, 嵌入相似性

Abstract: Adversarial samples are generated by introducing subtle perturbations at the lexical or semantic level to the original samples, causing models to produce incorrect outputs with high confidence. Due to their high similarity in the embedding space to the original samples, detecting these adversarial samples is particularly challenging. Furthermore, most language models are primarily designed for tasks such as text generation or classification, rather than for generating high-quality sentence embeddings, making it difficult to effectively distinguish adversarial samples from normal ones. This issue becomes especially prominent when dealing with complex word-level adversarial attacks, where subtle semantic differences often go unnoticed by the model, thereby impairing detection performance. To address these limitations, an innovative multi-stage learning approach for sentence embedding models is proposed, systematically optimizing the embedding space of the SBERT (sentence BERT) model to significantly enhance the distinction between adversarial and normal samples. In the first stage, contrastive learning is applied to improving SBERT??s ability to distinguish adversarial samples from normal ones, ensuring their separation in the embedding space. In the second stage, supervised contrastive learning combined with multi-level noise augmentation further refines the embedding space, promoting tighter clustering of similar samples and maximizing the separation of dissimilar ones. In the third stage, a classifier is used to assign labels to the model’s embedding vectors. Experiments conducted under adversarial attacks targeting the BERT and Mamba models, using three classification datasets and various types of textual adversarial attacks, show that the proposed method outperforms existing methods in detecting adversarial samples. Moreover, it exhibits strong generalization across models, attacks, and datasets, providing a novel and effective approach for textual adversarial sample detection.

Key words: textual adversarial sample detection, SBERT, contrastive learning, sentence embedding models, noise enhancement, embedding similarity

常戬, 张辉, 金海波, 王冰冰. 多阶段学习的SBERT单词级文本对抗性样本检测[J]. 计算机科学与探索, 2025, 19(9): 2493-2505.

CHANG Jian, ZHANG Hui, JIN Haibo, WANG Bingbing. Multistage Learning for SBERT Word-Level Adversarial Sample Detection[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(9): 2493-2505.

参考文献

[1] SZEGEDY C, ZAREMBA W, SUTSKEVER I, et al. Intriguing properties of neural networks[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1312.6199.
[2] GOODFELLOW I J, SHLENS J, SZEGEDY C, et al. Explaining and harnessing adversarial examples[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1412.6572.
[3] MOOSAVI-DEZFOOLI S M, FAWZI A, FROSSARD P. DeepFool: a simple and accurate method to fool deep neural networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 2574-2582.
[4] CARLINI N, WAGNER D. Towards evaluating the robustness of neural networks[C]//Proceedings of the 2017 IEEE Symposium on Security and Privacy. Piscataway: IEEE, 2017: 39-57.
[5] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 4171-4186.
[6] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901.
[7] JIN D, JIN Z J, ZHOU J T, et al. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8018-8025.
[8] VITORINO J, MAIA E, PRA?A I. Adversarial evasion attack efficiency against large language models[EB/OL]. [2024-07-18]. https://arxiv.org/abs/ 2406.08050.
[9] XU X, KONG K, LIU N, et al. An LLM can fool itself: a prompt-based adversarial attack[EB/OL]. [2024-07-18]. https://arxiv.org/abs/ 2310.13345.
[10] MIYATO T, DAI A M, GOODFELLOW I J. Adversarial training methods for semi-supervised text classification[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1605.07725.
[11] LI J, JI S, DU T, et al. TextBugger: generating adversarial text against real-world applications[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1812.05271.
[12] WIYATNO R R, XU A, DIA O A, et al. Adversarial examples in modern machine learning: a review[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1911.05268.
[13] MADRY A, MAKELOV A, SCHMIDT L, et al. Towards deep learning models resistant to adversarial attacks[EB/OL]. [2024-07-18]. https://arxiv.org/abs/1706.06083.
[14] AKHTAR N, MIAN A, KARDAN N, et al. Advances in adversarial attacks and defenses in computer vision: a survey[J]. IEEE Access, 2021, 9: 155161-155196.
[15] RAINA V, TAN S, CEVHER V, et al. Extreme miscalibration and the illusion of adversarial robustness[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 2500-2525.
[16] BAO R, ZHENG R, DING L, et al. CASN: class-aware score network for textual adversarial detection[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 671-687.
[17] YOO K, KIM J, JANG J, et al. Detection of adversarial examples in text classification: benchmark and baseline via robust density estimation[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 3656-3672.
[18] ZHOU Y C, JIANG J Y, CHANG K W, et al. Learning to discriminate perturbations for blocking adversarial attacks in text classification[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 4904-4913.
[19] MOZES M, STENETORP P, KLEINBERG B, et al. Frequency-guided word substitutions for detecting textual adversarial examples[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 171-186.
[20] LIU N, DRAS M, ZHANG W E. Detecting textual adversarial examples based on distributional characteristics of data representations[C]//Proceedings of the 7th Workshop on Representation Learning for NLP. Stroudsburg: ACL, 2022: 78-90.
[21] MOSCA E, AGARWAL S, RANDO RAMíREZ J, et al. “That is a suspicious reaction!”: interpreting logits variation to detect NLP adversarial attacks[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 7806-7816.
[22] HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduction by learning an invariant mapping[C]//Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2006: 1735-1742.
[23] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using siamese BERT-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 3982-3992.
[24] GU A, DAO T M. Mamba: linear-time sequence modeling with selective state spaces[EB/OL]. [2024-07-19]. https://arxiv.org/abs/2312.00752.
[25] REN S H, DENG Y H, HE K, et al. Generating natural language adversarial examples through probability weighted word saliency[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 1085-1097.
[26] LI L Y, MA R T, GUO Q P, et al. BERT-ATTACK: adversarial attack against BERT using BERT[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6193-6202.
[27] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 9726-9735.
[28] TUNSTALL L, REIMERS N, JO U E, et al. Efficient few-shot learning without prompts[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 3638-3652.
[29] OHASHI S, TAKAYAMA J, KAJIWARA T, et al. Text classification with negative supervision[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 351-357.
[30] ROBINSON J, CHUANG C Y, SRA S, et al. Contrastive learning with hard negative samples[EB/OL]. [2024-07-19]. https://arxiv.org/abs/2010.04592.
[31] ZANTEDESCHI V, NICOLAE M I, RAWAT A. Efficient defenses against adversarial attacks[EB/OL]. [2024-07-19]. https://arxiv.org/abs/1707.06728.
[32] LIU X Q, CHENG M H, ZHANG H, et al. Towards robust neural networks via random self-ensemble[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 381-397.
[33] LIU X Q, XIAO T S, SI S, et al. How does noise help robustness? Explanation and exploration under the neural SDE framework[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 279-287.
[34] SONG Y, ERMON S. Generative modeling by estimating gradients of the data distribution[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 1067.
[35] KIRKPATRICK S, GELATT C D, VECCHI M P. Optimization by simulated annealing[J]. Science, 1983, 220(4598): 671-680.
[36] KINGMA D P, WELLING M. Auto-encoding variational Bayes[C]//Proceedings of the 2nd International Conference on Learning Representations, 2014.
[37] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th International Conference on Machine Learning, 2008: 1096-1103.
[38] GUNEL B, DU J, CONNEAU A, et al. Supervised contrastive learning for pre-trained language model fine-tuning[EB/OL]. [ 2024-07-19]. https://arxiv.org/abs/2011.01403.
[39] KHOSLA P, TETERWAK P, WANG C, et al. Supervised contrastive learning[C]//Advances in Neural Information Processing Systems 33, 2020: 18661-18673.
[40] ZENG G Y, QI F C, ZHOU Q R, et al. OpenAttack: an open-source textual adversarial attack toolkit[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Stroudsburg: ACL, 2021: 363-371.
[41] CER D, YANG Y F, KONG S Y, et al. Universal sentence encoder for English[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg: ACL, 2018: 169-174.
[42] MCINNES L, HEALY J, MELVILLE J. UMAP: uniform manifold approximation and projection for dimension reduction[EB/OL]. [2024-07-19]. https://arxiv.org/abs/1802.03426.
[43] LUNDBERG S, LEE S. A unified approach to interpreting model predictions[EB/OL]. [2024-07-19]. https://arxiv.org/abs/1705.07874.