计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (6): 958-965.DOI: 10.3778/j.issn.1673-9418.1906030

• 网络与信息安全 • 上一篇    下一篇

面向私有二进制协议的报文聚类方法

徐旭东,张志祥,张献   

  1. 海军工程大学 电子工程学院,武汉 430033
  • 出版日期:2020-06-01 发布日期:2020-06-04

Message Clustering Method for Private Binary Protocol

XU Xudong, ZHANG Zhixiang, ZHANG Xian   

  1. College of Electronic Engineering, Navy University of Engineering, Wuhan 430033, China
  • Online:2020-06-01 Published:2020-06-04

摘要:

报文聚类是协议逆向工程的主要步骤之一。针对私有二进制协议报文,目前的报文聚类方法存在报文向量化特征冗余的问题,而且传统聚类方法存在聚类中心和聚类簇数难以确定的问题。根据n-gram序列化的思想,构造报文的序列项-位置矩阵,从中挖掘频繁项,构造报文特征向量,有效去除了报文向量化中的序列噪声;采用轮廓系数指导分拆式层次聚类,避免了初始聚类簇数和聚类中心的选择,以实现无监督条件下的私有二进制协议报文的聚类。在AIS、DNS、ICMP、ARP四种协议共七类报文的数据集上测试,通过t-SNE可视化界面观察报文分布情况,特征向量化的方法具有很好的分布和特征表达效果;相较于传统的聚类方法,基于轮廓系数的分拆式层次聚类在纯净度和F1值上具有明显提升。

关键词: 二进制协议, 报文聚类, 特征向量生成, 分拆式层次聚类, 频繁项挖掘

Abstract:

Message clustering is one of the main steps of protocol reverse engineering. For the private binary protocol packets, the current message clustering method has the problem of message vectorization feature redundancy, and the traditional clustering method has the problem that the cluster center and the number of clusters are difficult to determine. According to the idea of n-gram serialization, the sequence item-location matrix of the message is constructed, frequent items are mined, and the message feature vector is constructed, which effectively removes the sequence noise in the message vectorization. The contour coefficient is used to guide the split hierarchical clus-tering, which avoids the initial clustering number and clustering center selection, so as to realize the clustering of private binary protocol messages under unsupervised conditions. The testing is carried out on a data set of 7 types of messages with 4 protocals: AIS, DNS, ICMP and ARP. The t-SNE visual interface is used to observe the distribution of packets. The feature vectorization method has a good distribution and feature expression. Compared with the traditional clustering method, the split-level hierarchical clustering based on the contour coefficient has significant improvement in purity and F1 value.

Key words: binary protocol, message clustering, feature vector generation, split hierarchical clustering, frequent item mining