Journal of Frontiers of Computer Science and Technology ›› 2014, Vol. 8 ›› Issue (8): 945-955.DOI: 10.3778/j.issn.1673-9418.1402030

Previous Articles     Next Articles

Automatic Diagnosis and Problem Management Approach for Online System

YANG Xinsheng1+, LI Hong2, WANG Wei1, HUANG Xiang3, WEI Jun1   

  1. 1. Technology Center of Software Engineering, Institute of Software, University of Chinese Academy of Sciences, Beijing 100190, China
    2. Publishing House of Electronics Industry, Beijing 100036, China
    3. School of Information Science and Technology, Sun Yat-Sen University, Guangzhou 510006, China
  • Online:2014-08-01 Published:2014-08-07

在线服务系统自动化故障诊断与错误管理方法

杨鑫晟1+,李  弘2,王  伟1,黄  翔3,魏  峻1   

  1. 1. 中国科学院大学 软件研究所 软件工程研究中心,北京 100190
    2. 电子工业出版社,北京 100036
    3. 中山大学 信息科学与技术学院,广州 510006

Abstract: Online system has to serve massive users’ requests in real time. Various issues will cause request failure during the system running. Due to large scale of the requests and logs, manual effort of failed requests management is highly time-consuming. This paper proposes a new problem management approach based on clustering, by mining the logs generated from the system and performing classification on the failed requests. Engineers/administrators can get an overview of the issues of the online system and know the distribution and trend of failed requests with the help of the proposed approach. Moreover, this approach can identify the code defect automatically and reduce the cost of debugging by using voting approach to locate the cause of each type of issue into a small range in the source code. Besides, this approach realizes a problem management tool, and deploys the tool on a real enterprise online system. The experimental results turn out that the proposed approach can manage failed requests efficiently and reduce the cost of maintenance and debugging. Moreover, this approach can be combined with the MapReduce technique to deal with large scale logs.

Key words: problem management, log analysis, clustering

摘要: 在线服务系统需要实时接受并处理大量的用户请求,各种原因的错误都可能导致用户的请求失败。由于在线服务系统的请求数量多,产生的日志数据量大,传统使用人工方法进行请求失败原因的分析、统计和管理的工作量巨大。提出了一种基于聚类算法的自动化故障诊断与错误管理方法,通过自动分析在线服务系统自身产生的日志,对用户的失败请求进行归类,帮助程序开发人员和系统管理员进行错误管理,并了解各种失败请求的类别、分布和趋势。同时,使用投票机制自动将每一类错误定位到其所在的源码位置,实现软件缺陷和问题的快速定位。基于该方法实现了一个自动化的错误管理工具,并应用在企业级在线服务系统中。实验结果表明,该方法可以对系统故障进行有效管理,降低系统维护和错误调试成本,还能与MapReduce技术相结合来处理海量的日志信息。

关键词: 错误管理, 日志分析, 聚类