淺談 Checkpoint / Snapshot

Fault Tolerance …

在分布式系統中,很重要的元素。當我們有多個機器時,如果任何一個環節出錯,我們該怎麼從錯誤中回復呢。這次介紹了兩種個方式:Checkpoint, snapshot。

In distributed systems, fault tolerance is an important element. When there are several machines, how we recover state if something goes down. We will introduce two ways: checkpoint and snapshot.

Types of failures …

1.
Crash failures
機器停止回應,這種情況可以從上一個儲存狀態中回復。
Machine stop responding, it can recover from last saved state.

2.
Fail stop
跟 Crash failure 的反應一樣,但他失去所有的狀態並且無法從上一個狀態中回復。
Although exhibiting like crash failure, it loses all state and cannot resume with last saved state.

擷取自網路

Checkpoint …

1.
定期存取現在的狀態。
Checkpoint process state periodically.

2.
重新啟動機器後,依然可以從上一個存取狀態還原。(存取不會因為重新啟動而消失。)
After restarting, it still can recover from last saved checkpoint. (Which means checkpoint won’t fade because of restarting.)

Challenge of checkpoint …

每次傳遞 checkpoint 時,執行序之間會互相確認,這表示執行序之間是互相依賴的。如果其中一個執行序失敗,因為依賴關係,會導致 checkpoint 失敗,而回到初始狀態,這種情況稱為:骨牌效應。

Since process will check each other after sending message, it means the process is dependent with each other. If there is one process failed, the checkpoint will fail and the process will go back to the initial state because of dependency. This situation is called Domino effect.

擷取自網路

Chandy-Lamport algorithm …

一種用於分布式系統的演算法。該演算法針對如何抓取全局一致的狀態。
An algorithm is used in distributed system. The algorithm is for capturing the consistent global state for synchronization.

Global snapshot …

抓取 “發生前" 的狀態。
Capture state that happens before.
For example:
If a -> b, we take snapshot in b state, then a must be in snapshot. (like the figure)

Process of global snapshot …

The sender …

1.
其中一個執行序(發送者)記錄自己的本地狀態。
One of processes (sender) records its local state.

2.
廣播 marker 訊息給其他執行序。 Marker 訊息是告訴其他執行序有 snapshot 的需求,並且記錄marker訊息前的所有訊息。
Sender broadcasts marker message to others. The goals of marker message are telling other process the need for snapshot and recording all messages before receiving marker message.

3.
開始記錄所有回傳的訊息。
Start to record received messages from incoming channel.

The receiver …

1.
First case: if the state has not been recorded by received process when received the marker message
a. 紀錄自己的本地狀態 Record local state
b. 將接收執行序跟發送執行序之間的通道狀態設定為 empty
(Record state of channel from sender to receiver as empty)
c. 發送 marker 訊息給其他執行序 Send marker message to other processes.

2.
Second case: the state of receiver has been recorded
a. 停止紀錄 stop recording
b. 將接收執行序跟發送執行序之間的通道狀態設定為 message received
( Record state of channel from sender to receiver as message received)

-MsHe

發表留言