Journal of Frontiers of Computer Science and Technology ›› 2007, Vol. 1 ›› Issue (2): 191-199.

• 学术研究 • Previous Articles     Next Articles

A new hybrid mechanism for Checkpoint/Restart in OpenMP programs

HUANG Chun+,LIU Yongpeng,YANG Xuejun   

  1. School of Computer, University of Defense Technology, Changsha 410073, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-08-20 Published:2007-08-20
  • Contact: HUANG Chun

面向OpenMP的混合检查点机制

黄 春+,刘勇鹏,杨学军   

  1. 国防科技大学 计算机学院,长沙 410073
  • 通讯作者: 黄 春

Abstract: Checkpoint/Restart is one of the important approaches for software fault-tolerance. In this paper, the system-level and application-level coordinated Checkpoint/Restart mechanisms for OpenMP programs are presented. The system-level support is introduced for transparency, and it makes shared data saved by all threads together. The semantics-related operations of OpenMP will be separated from and hence independent of low-level systems by the application-level OpenMP checkpoint protocol, which improves portability of the checkpoint system. Based on the presented mechanism, a CCRG OpenMP Checkpoint/Restart system has been implemented. The experiments, such as NPB3.2-OMP, show the overhead of checkpointing and restarting is so limited that the system can be used in large scale programs.

Key words: OpenMP, Checkpoint/Restart, system-level and application-level coordinated

摘要: 检查点/续算是软件容错的重要途径之一。论文描述了一个系统级和应用级混合的OpenMP检查点机制,系统级支持不仅使检查点系统具有了好的透明性,并且使共享数据的保存不再由主线程单独完成,具有良好的数据局部性。应用级OpenMP协议将与OpenMP相关的协议处理独立出来,提高了系统的可移植性。NPB3.2-OMP测试结果表明,检查点和续算所需要的时间开销小,能够满足大规模程序的实际需求。

关键词: OpenMP, 检查点/续算, 系统级和应用级协同