Checkpointing the Un-checkpointable: the Split-Process Approach for MPI and Formal Verification

Published on 30 Dec 2019, 18:59
Checkpointing is the ability to save the state of a running process to stable storage, and later restart, perhaps on a different computer. Transparent checkpointing (or system-level checkpointing) is the ability to checkpoint a (possibly parallel or distributed) application, without modifying the application binaries. The speaker has led the DMTCP project in this area for 15 years.

This talk presents an efficient, new software architecture: split processes. It is being applied to MPI ("MANA for MPI") and formal verification. The key insight is that any numerical, network and system libraries (e.g., including MPI and CUDA) reside in process memory that we tag as the "lower half". Only application-specific code is checkpointed (text, data and stack). On restart, a separate lower-half process is started as a trivial application. A bit of plumbing then re-connects the upper half with this lower-half, and unites them in a single thread of control. Runtime overhead remains near zero, since the lower-half libraries on the destination host have been optimized for that host. Overhead is near zero even for heterogeneous systems with wildly different libraries (e.g., TCP on source host and InfiniBand on destination).

A test of any new theory is how broadly it can be applied. The split-process approach is related in spirit to Microsoft Drawbridge and to the design of WSL. But its internals are different. This difference enables ongoing work in exploring split processes to provide the first native checkpointing support for Microsoft Windows that doesn't require any administrative privilege.

The new theory is also being applied to formal verification, in a collaboration with Martin Quinson and the SimGrid team. SimGrid provides model checking for Pthread, MPI, and distributed applications, based directly on the original source code (no intermediate model necessary). SimGrid successfully finds race conditions and safety property violations in real-world software, but it has largely been limited to the first few minutes of execution. However, as one scales up, a rare race condition may eventually trigger, resulting in a new crash beyond the reach of SimGrid. To capture this case, applications are started natively (without SimGrid), while periodic checkpointing under DMTCP. After a crash, the application is restarted from the last checkpoint, but using a SimGrid-based lower half. SimGrid then produces a schedule that exhibits the crash, regardless of any race conditions.

The split-process approach represents joint work with Rohan Garg.

Talk slides available:

See more on this and other talks at Microsoft Research: