Problem
I am running a primary/secondary validator setup with two physically separate validators. Both validators were running 1.16.16-jito w/local vote mods (the vote mods are not believed to alter any aspect of accounts state, because the mods only alter which slots are voted on in replay_stage.rs).
Both primary and secondary crashed simultaneously with the same error message. This is two separate validators experiencing the same issue. Note that the secondary likely took the tower from the primary shortly before the primary crashed as is common behavior when a secondary's tower gets out of sync with the primary.
The log message on both validators was:
[2023-10-16T22:47:33.894214958Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at 'We have tried to repair duplicate slot: 224116920 more than 10 times and are unable to freeze a block with bankhash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7, instead we have a block with bankhash Some(GVZoNMHn1o8CB76i3LLRHjTA1p8f4iyTa6hn9eopWJh1). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1585:25" location="core/src/replay_stage.rs:1585:25" version="1.16.16 (src:00000000; feat:4033350765, client:JitoLabs)"
I have collected the following files from one of the validators:
snapshot:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst
incremental snapshot:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/incremental-snapshot-224107039-224116213-qg6B13UGAAMKDm6XqG6p6PchnZ7RAgjhbe759igiFce.tar.zst
ledger (copied 10,000 slots leading up to the crash):
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/ledger.tar.gz
validator logs leading up to the crash:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/validator.log.gz