| Clemson Home > CCIT Home | Skip Navigation | A-Z Index Calendar CU Safety Map Webcams Phonebook |
Using checkpoint/restart
Under construction - preliminary, incomplete, and untested - do not use yet.
Introduction
The term "checkpointing" refers to saving the state of an executing process to disk so that the process can be restarted from that point at a later time. This is critical for long running jobs that have CPU time requirements that exceed the palmetto queue configurations. This document describes the steps you must take to
- determine if your code is checkpointable,
- determine where algorithmically to checkpoint your code,
- modify your source code,
- construct a PBS script for checkpointing,
- submit your code to PBS for execution.
Determining if Code is Checkpointable
Some jobs are too complicated to checkpoint because for some processes it is impossible to recreate the state of the process. However, long-running, computationally intensive jobs are generally well suited for checkpointing.
Another limitation to the checkpointability of a job is file I/O. Jobs that read exclusively from one set of files and write exclusively to another set of files are generally checkpointable. Jobs that read from a file, write to a file, and then read again from that file are not checkpointable. In addition, the following type of system calls have been known to cause problems with checkpointing:
- Administrative
- Signals
- Forks
- Dynamic loading
- Shared memory
- Semaphore
- Messages
- Internal timers
- Set user/group id
You should avoid these calls in the code that you want to checkpoint.
Determining Where Algorithmically to Checkpoint
Ideally, you should insert checkpointing instructions into your code at locations that are executed infrequently, but consistently. The reason for this is because checkpointing requires a significant amount of overhead. If a program is checkpointed too often, the operation of checkpointing will severely hinder the performance of the program. If files are not checkpointed frequently enough then there is a risk of losing significant computational results. An example of an ideal checkpointing region is shown in the following sample code section.
program dot
do 10 i = 1, 100000
call bigsub(i,y)
< CHECKPOINTING INSTRUCTIONS >
10 continue
print *, y
endIf it took the routine bigsub 45 minutes to execute, then this code would checkpoint every 45 minutes. You could also set up the checkpointing instructions to checkpointe only every 90 minutes, 135 minutes, etc. For example:
program dot
do 10 i = 1, 100000
call bigsub(i,y)
if (mod(i,2).eq.0) then
< CHECKPOINTING DIRECTIVES >
endif
10 continue
print *, y
endIn this case the program would checkpoint every 90 minutes. Changing the value of the mod argument changes the checkpointing interval.
More detailed example
PBS scripting and checkpoint/restart
- Login to post comments
