A system can fail in the middle of a long and expensive computation. Instead of starting from scratch, there are two complementary techniques we can use:

  1. Checkpointing: the system periodically saves its current state on disk, and nodes can go back to the checkpoint in case of crash.
  2. Logging: messages and nondeterministic events are logged, and after rolling back to a checkpoint, these logs are replayed to bring the node back to its last valid state.

Centralized Checkpointing

Let a โ€œcutโ€ of distributed execution be a prefix of each nodeโ€™s local execution. The cut is โ€œconsistentโ€ if for every event it contains, it also contains all events that happened before itโ€”this means that every received message must have been sent.

To roll back after a crash, we need to find checkpoints that form a consistent cutโ€”called a recovery line. If the most recent checkpoints arenโ€™t consistent, we need to go further back. However, if nodes take checkpoints independently, it is very unlikely that any cut will be consistent; in other words, we need the nodes to coordinate their checkpointing.

Centralized checkpointing provides a solution similar to Two-Phase Commit (2PC):

  1. Coordinator sends โ€œcheckpointโ€ to all processes.
  2. When a process receives the message, it takes a checkpoint, queues ongoing messages (to avoid inconsistency), and responds with โ€œack.โ€
  3. Once all โ€œacksโ€ are received, the coordinates sends โ€œdone,โ€ allowing queued messages to be sent.

While this solution works, itโ€™s not ideal since it needs a central coordinator and queued messages; it also loses in-flight messages, ones that were sent before the senderโ€™s checkpoint but received after the recipientโ€™s checkpoint.

Chandy-Lamport

What we really want is a โ€œsnapshotโ€ of the system containing both checkpoints and in-flight messages. We can do so with the Chandy-Lamport algorithm, which is also fully distributed and does not need to queue messages:

  1. Let every node have direct, reliable FIFO channels to other nodes .
  2. When wants to initiate a snapshot:
    1. It takes a local checkpoint.
    2. It sends a special marker message to each outgoing channel.
    3. It begins recording messages that arrive on incoming channels.
  3. When node receives the marker from on channel :
    1. If it hasnโ€™t taken a checkpoint, it takes the checkpoint, sends a marker to each outgoing channel, records the state of as empty set, and begins recording messages on all other incoming channels .
    2. If it already took a checkpoint, it stops recording messages for .
  4. Once receives a marker from every other node, the algorithm terminates.

With this, the recorded messages are the missing in-flight ones at snapshot time.