Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (10): 2264-2272.DOI: 10.3778/j.issn.1673-9418.2103066

• High Performance Computing • Previous Articles     Next Articles

Research on Method of Log Pattern Extracting in High-Performance Computing Environment

WANG Xiaodong1,2, ZHAO Yining1,+(), XIAO Haili1, WANG Xiaoning1, CHI Xuebin1,2   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2021-03-19 Revised:2021-05-20 Online:2022-10-01 Published:2021-05-31
  • About author:WANG Xiaodong, born in 1989, Ph.D. candidate. His research interests include log processing and analyzing, data mining and machine learning.
    ZHAO Yining, born in 1983, research assistant. His research interests include log processing and analyzing and data mining.
    XIAO Haili, born in 1978, M.S., professor. His research interests include grid computing and distributed computing.
    WANG Xiaoning, born in 1981, Ph.D., associate professor. Her research interests include grid computing and distributed computing.
    CHI Xuebin, born in 1963, professor. His research interests include parallel computing and software, grid computing technology, high performance computer system maintenance and management.
  • Supported by:
    Strategic Priority Research Program of the Chinese Academy of Sciences(XDA19020101)

高性能计算环境中日志模式提炼方法的研究

王晓东1,2, 赵一宁1,+(), 肖海力1, 王小宁1, 迟学斌1,2   

  1. 1.中国科学院 计算机网络信息中心,北京 100190
    2.中国科学院大学,北京 100049
  • 通讯作者: + E-mail: zhaoyn@sccas.cn
  • 作者简介:王晓东(1989—),男,河南洛阳人,博士研究生,主要研究方向为日志处理分析、数据挖掘、机器学习。
    赵一宁(1983—),男,河北河间人,助理研究员,主要研究方向为日志处理分析、数据挖掘。
    肖海力(1978—),男,湖北天门人,硕士,研究员,主要研究方向为网格计算、分布式计算。
    王小宁(1981—),女,四川资阳人,博士,副研究员,主要研究方向为网格计算、分布式计算。
    迟学斌(1963—),男,吉林梅河口人,研究员,主要研究方向为并行计算与软件、网格计算技术、高性能计算机系统维护与管理。
  • 基金资助:
    中国科学院战略性先导科技专项项目(A类)(XDA19020101)

Abstract:

Log analysis plays an important role in the stable operation of computer system. However, logs are usua-lly unstructured, which is not conducive to automatic analysis. How to categorize logs and turn them into structured data automatically is of great practical significance. In this paper, LDmatch algorithm is proposed, which imple-ments a log pattern extracting algorithm based on word matching rate. Traditional log matching algorithms use one-to-one word matching method in similarity calculation, while the proposed LDmatch algorithm calculates the simi-larity between logs according to the longest common subsequence (LCS) of words contained in two logs, and classi-fies logs based on the LCS. LDmatch algorithm can also get real-time log template and update. In addition, the pat-tern warehouse of the algorithm uses a data structure based on hash table for storage, which refines the classification of logs and reduces the times of comparison during log matching, thus improving the matching efficiency of the algorithm. In order to verify the advantages of the algorithm, it is applied to the open source data set and the actual log data set generated by the CNGrid. A variety of other log pattern extraction algorithms are used for comparison and experimental results are obtained. Finally, the advantages of the algorithm in accuracy, robustness and efficiency are proven.

Key words: log pattern extraction, word matching rate, log template, hash table

摘要:

日志分析对于计算机系统的稳定运行起着至关重要的作用,然而日志通常是非结构化的,不利于自动化分析,如何自动化将日志的模式提炼出来并变成结构化的数据具有重要的实际意义。提出了LDmatch算法,该算法以单词匹配率为基础实现了一种日志模式提炼算法。传统的日志匹配算法在进行相似度计算时使用一对一单词匹配法,而LDmatch算法根据两条日志所包含的单词之间的最长公共子序列计算日志之间的相似度,并以此为基础进行日志分类。LDmatch算法还能实时得到日志模板并更新。除此之外,该算法的模式仓库使用了基于哈希表的数据结构进行存储,该存储结构细化了日志的分类,减少了日志匹配时的比较次数,从而提高了日志模式提炼算法的匹配效率。为了验证算法的优势,将LDmatch算法应用于开源数据集以及国家高性能计算环境实际产生的日志数据集,并且使用多种其他日志模式提炼算法进行对比并得出实验结果,最终证明了该算法在准确度、鲁棒性和效率上具有优势。

关键词: 日志模式提炼, 单词匹配率, 日志模板, 哈希表

CLC Number: