计算机科学与探索

• 学术研究 •    下一篇

多阶段学习的SBERT单词级文本对抗性样本检测

常戬,张辉,金海波,王冰冰   

  1. 辽宁工程技术大学 软件学院, 辽宁 葫芦岛 125105

Multistage Learning for SBERT Word-Level Adversarial Sample Detection

CHANG Jian,  ZHANG Hui,  JIN Haibo,  WANG Bingbing   

  1. College of Software, Liaoning Technical University, Huludao, Liaoning 125105, China

摘要: 对抗性样本是在原样本上添加微小扰动,使模型以高置信度产生错误输出的样本。由于其在嵌入空间与原样本高度相似,检测难度较大。同时,大多数语言模型并非专为生成高质量嵌入向量设计,难以有效区分对抗性样本与正常样本,尤其在应对复杂的单词级对抗攻击时,细微的语义差异通常难以被捕捉,从而影响检测性能。针对这一局限,提出了一种创新的句子嵌入模型多阶段学习方法,系统优化SBERT模型的嵌入空间表达,显著放大对抗性样本与普通样本的差异性。第一阶段的训练通过对比学习增强SBERT的区分能力,使对抗性样本与正常样本表征分离;第二阶段的训练结合监督对比学习和多级噪声增强,进一步优化嵌入空间,使同类样本更紧密聚集、异类样本充分分离;第三阶段利用分类器将模型的嵌入向量映射为标签。实验在BERT和Mamba模型作为攻击目标的情况下,针对三种分类数据集和多种文本对抗性攻击类型进行测试,结果表明该方法在检测对抗性样本时效果优异,同时也具备出色的跨模型、跨攻击和跨数据集的泛化能力,为文本对抗性样本检测提供了新的方法和思路。

关键词: 文本对抗性样本检测, SBERT, 对比学习, 句子嵌入模型, 噪声增强, 嵌入相似性

Abstract: Adversarial samples are generated by introducing subtle perturbations at the lexical or semantic level to the original samples, causing models to produce incorrect outputs with high confidence. Due to their high similarity in the embedding space to the original samples, detecting these adversarial samples is particularly challenging. Furthermore, most language models are primarily designed for tasks such as text generation or classification, rather than for generating high-quality sentence embeddings, making it difficult to effectively distinguish adversarial samples from normal ones. This issue becomes especially prominent when dealing with complex word-level adversarial attacks, where subtle semantic differences often go unnoticed by the model, thereby impairing detection performance. To address these limitations, an innovative multi-stage learning approach for sentence embedding models is proposed, systematically optimizing the embedding space of the SBERT model to significantly enhance the distinction between adversarial and normal samples. In the first stage, contrastive learning is applied to improve SBERT’s ability to distinguish adversarial samples from normal ones, ensuring their separation in the embedding space. In the second stage, supervised contrastive learning combined with multi-level noise augmentation further refines the embedding space, promoting tighter clustering of similar samples and maximizing the separation of dissimilar ones. In the third stage, a classifier is used to assign labels to the model’s embedding vectors. Experiments conducted under adversarial attacks targeting the BERT and Mamba models, using three classification datasets and various types of textual adversarial attacks, show that the proposed method outperforms existing methods in detecting adversarial samples. Moreover, it exhibits strong generalization across models, attacks, and datasets, providing a novel and effective approach for textual adversarial sample detection.

Key words: textual adversarial sample detection, SBERT, contrastive learning, sentence embedding models, noise enhancement, embedding similarity