SIMD-0046
Optimistic cluster restart automation
TL;DR
During a cluster restart following an outage, make validators enter a separate recovery protocol that uses Gossip to exchange local status and automatically reach consensus on the block to restart from. Proceed to restart if validators in the restart can reach agreement, or print debug information and halt otherwise. To distinguish the new restart process from other operations, we call the new process "Wen restart".
Summary
During a cluster restart following an outage, make validators enter a separate recovery protocol that uses Gossip to exchange local status and automatically reach consensus on the block to restart from. Proceed to restart if validators in the restart can reach agreement, or print debug information and halt otherwise. To distinguish the new restart process from other operations, we call the new process "Wen restart".
Motivation
Currently during a `cluster restart`, validator operators need to decide the highest optimistically confirmed slot, then restart the validators with new command-line arguments. The current process involves a lot of human intervention, if people make a mistake in deciding the highest optimistically confirmed slot, it is detrimental to the viability of the ecosystem. We aim to automate the negotiation of highest optimistically confirmed slot and the distribution of all blocks on that fork, so that we can lower the possibility of human mistakes in the `cluster restart` process. This also reduces the burden on validator operators, because they don't have to stay around while the validators automatically try to reach consensus, the validator will halt and print debug information if anything goes wrong, and operators can set up their own monitoring accordingly. However, there are many ways an automatic restart can go wrong, mostly due to unforseen situations or software bugs. To make things really safe, we apply multiple checks during the restart, if any check fails, the automatic restart is halted and debugging info printed, waiting for human intervention. Therefore we say this is an optimistic cluster restart procedure.
Key Changes
- cluster restart: When there is an outage such that the whole cluster
- cluster restart slot: In current cluster restart scheme, human normally
- optimistically confirmed block: a block which gets the votes from the
- wen restart phase: During the proposed optimistic cluster restart
- wen restart shred version: right now we update shred_version during a
- RESTART_STAKE_THRESHOLD: We need enough validators to participate in a
- The operator restarts the validator into the wen restart phase at boot,
- While aggregating local vote information from all others in restart, the
- After enough validators are in restart and repair is complete, the validator
- A coordinator which is configured on everyone's command line sends out its
- Each validator verifies that the coordinator's choice is reasonable:
- If yes, proceed and restart
- If no, print out what it thinks is wrong, halt and wait for human
- Gossip last vote and ancestors on that fork
- last_voted_slot: u64 the slot last voted, this also serves as last_slot for the bit vector.
- last_voted_hash: Hash the bank hash of the slot last voted slot.
- ancestors: Run-length encoding compressed bit vector representing the slots on sender's last voted fork. the least significant bit is always last_voted_slot, most significant bit is last_voted_slot-65535.
- Repair ledgers up to the restart slot
- Calculate heaviest fork
- Calculate the threshold for a block to be on the heaviest fork, the heaviest fork should have all blocks with possibility to be optimistically confirmed. The number is 67% - 5% - stake_on_validators_not_in_restart.
Impact
This proposal adds a new `wen restart` mode to validators, under this mode the validators will not participate in normal cluster activities. Compared to today's `cluster restart`, the new mode may mean more network bandwidth and memory on the restarting validators, but it guarantees the safety of optimistically confirmed user transactions, and validator operators don't need to manually generate and download snapshots during a `cluster restart`.
Backwards Compatibility
This change is backward compatible with previous versions, because validators only enter the new mode during new restart mode which is controlled by a command line argument. All current restart arguments like `--wait-for-supermajority` and `--expected-bank-hash` will be kept as is.
Security Considerations
The two added Gossip messages `RestartLastVotedForkSlots` and `RestartHeaviestFork` will only be sent and processed when the validator is restarted in `wen restart` mode. So random validator restarting in the new mode will not clutter the Gossip CRDS table of a normal system. Non-conforming validators could send out wrong `RestartLastVotedForkSlots` messages to mess with `cluster restart`s, these should be included in the Slashing rules in the future. ### Handling oscillating votes Non-conforming validators could change their last votes back and forth, this could lead to instability in the system. We forbid any change of slot or hash in `RestartLastVotedForkSlots` or `RestartHeaviestFork`, everyone will stick with the first value received, and discrepancies will be recorded in the proto file for later slashing. ### Handling multiple epochs Even though it's not very common that an outage happens across an epoch boundary, we do need to prepare for this rare case. Because the main purpose of `wen restart` is to make everyone reach aggrement, the following choices are made: * Every validator only handles 2 epochs, any validator will discard slots which belong to an epoch which is > 1 epoch away from its root. If a validator has very old root so it can't proceed, it will exit and report error. Since we assume an outage will be discovered within 7 hours and one epoch is roughly two days, handling 2 epochs should be enough. * The stake weight of each slot is calculated using the epoch the slot is in. Because right now epoch stakes are calculated 1 epoch ahead of time, and we only handle 2 epochs, the local root bank should have the epoch stakes for all epochs we need. * When aggregating `RestartLastVotedForkSlots`, for any epoch with validators voting for any slot in this epoch having at least 33% stake, calculate the stake of active validators in this epoch. Only exit this stage if all epochs reaching the above bar has > 80% stake. This is a bit restrictive, but it guarantees that whichever slot we select for HeaviestFork, we have enough validators in the restart. Note that the epoch containing local root should always be considered, because root should have > 33% stake. Now we prove this is safe, whenever there is a slot being optimistically confirmed in the new epoch, we will only exit the aggregating of `RestartLastVotedForkSlots` stage if > 80% in the new epoch joined: 1. Assume slot `X` is optimistically confirmed in the new epoch, it has >67% stake in the new epoch. 2. Our stake warmup/cooldown limit is at 9% currently, so at least 67% - 9% = 58% of the stake were from the old epoch. 3. We always have >80% stake of the old epoch, so at least 58% - 20% = 38% of the stake were in restart. Excluding non-conforming stake, at least 38% - 5% = 33% should be in the restart and they should at least report they voted for `X` which is in the new epoch. 4. According to the above rule we will require >80% stake in the new epoch as well.