计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (12): 2073-2084.DOI: 10.3778/j.issn.1673-9418.1901009

• 网络与信息安全 • 上一篇    下一篇

自编码网络在JavaScript恶意代码检测中的应用研究

龙廷艳,万良,丁红卫   

  1. 1.贵州大学 计算机科学与技术学院,贵阳 550025
    2.贵州大学 计算机软件与理论研究所,贵阳 550025
  • 出版日期:2019-12-01 发布日期:2019-12-10

Application Research of Autoencoder Network in Malicious JavaScript Code Detection

LONG Tingyan, WAN Liang, DING Hongwei   

  1. 1.School of Computer Science and Technology, Guizhou University, Guiyang 550025, China
    2.Institute of Computer Software and Theory, Guizhou University, Guiyang 550025, China
  • Online:2019-12-01 Published:2019-12-10

摘要: 针对传统机器学习特征提取方法很难发掘JavaScript恶意代码深层次本质特征的问题,提出基于堆栈式稀疏降噪自编码网络(sSDAN)的JavaScript恶意代码检测方法。首先将JavaScript恶意代码进行数值化处理,然后在自编码网络的基础上加入稀疏性限制,同时加入一定概率分布的噪声进行染噪的学习训练,使得自动编码器模型能够获取数据不同层次的特征表达;再经过无监督逐层贪婪的预训练和有监督的微调过程可以得到有效去噪后的更深层次特征;最后利用[Softmax]函数对特征进行分类。实验结果表明,稀疏降噪自编码分类算法对JavaScript具有较好的分类能力,其准确率高于传统机器学习模型,相比随机森林的方法提高了0.717%,相比支持向量机(SVM)的方法提高了2.237%。

关键词: 堆栈式稀疏降噪自编码网络(sSDAN), JavaScript恶意代码, 机器学习

Abstract: For the problem that it is difficult for traditional machine learning feature extraction methods to explore the deep essential features of JavaScript malicious code, a JavaScript malicious code detection method based on stacked sparse denoising autoencoder network (sSDAN) is proposed. Firstly, JavaScript malicious code is quantized. Through adding sparsity limitation to autoencoder network, and noise with a certain probability distribution is added for learning and training of noise dyeing, the automatic encoder model can obtain the feature expressions of different levels of data. Then, by unsupervised layer by layer greedy pre-training and supervised fine-tuning process, the deeper features of effective denoising are obtained. Finally, Softmax function is used to classify the features. Experimental results show that the sparse noise reduction autoencoder classification algorithm has a good classification ability for JavaScript, and its accuracy is higher than that of traditional machine learning models, e.g. it is 0.717% higher than that of the random forest method, and 2.237% higher than that of the SVM (support vector machine) method.

Key words: stacked sparse denoising autoencoder network (sSDAN), JavaScript malicious code, machine learning