Automated application-level checkpointing of mpi programs

2020-02-18 14:06

C 3: A System for Automating Applicationlevel Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department of Computer Science, Cornell University, Ithaca, NY Abstract.mpi program applicationlevel checkpointing mpi library state faulttolerance protocol stopping failure model software complexity many computational science application instrument mpi program faulty process hang highpeformance computing platform present experimental result computational science application running time suitable protocol mpi automated application-level checkpointing of mpi programs

Automated applicationlevel checkpointing of MPI programs We show how this protocol can be used with a precompiler that instruments CMPI programs to save application and MPI library state

We're upgrading the ACM DL, and would like your input. Please sign up to review new features, functionality and page designs. Automated applicationlevel checkpointing of MPI programs. By Greg Bronevetsky, Daniel Marques, In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments CMPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI automated application-level checkpointing of mpi programs mpi program applicationlevel checkpointing mpi library state suitable protocol stopping failure model software complexity running time instrument mpi program faulty process hang highpeformance computing platform faulttolerance protocol applicationlevel coordinated nonblocking checkpointing present experimental result computational science

CMPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small. 1 Introduction has been studied extensively in the context of distributed systems [6. automated application-level checkpointing of mpi programs Automated Applicationlevel Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department of Computer Science, Cornell University, Ithaca, NY Abstract checkpointing Because of increasing hardware and software complexity, uncoordinated coordinated the running time of many computational science applica tions is now more than the meantimetofailure ApplicationLevel Fault Tolerance for MPI Programs Keshav Pingali. 2 The Problem ApplicationLevel SystemLevel. 5 Solution Space Detail Checkpointing Blocking Coordinated Checkpointing Many programs are bulksynchronous programs (BSP model: Valiant). C3: A System for Automating Applicationlevel Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department of Computer Science, Cornell University, Ithaca, NY Abstract. Faulttolerance isbecoming necessary on highperformance platforms.

Rating: 4.33 / Views: 379