Skip to content

Conversation

@wen-coding
Copy link
Contributor

Broadcast BLS cert all-to-all.

@wen-coding wen-coding marked this pull request as draft July 20, 2025 00:18
@wen-coding wen-coding marked this pull request as ready for review July 20, 2025 01:33
@wen-coding wen-coding mentioned this pull request Jul 20, 2025
Copy link
Contributor

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! just a couple of comments

Copy link
Contributor

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @carllin any other comments?

Copy link
Contributor

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Left a few comments

Self { t_ingest }
}

fn maybe_update_root_and_send_new_certificates(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this go in root_utils.rs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sending certificates part should probably stay here.
We could move this part to root_utils.rs.

  // Reset standstill timer
            debug_assert!(new_finalized_slot > *highest_finalized_slot);
            *highest_finalized_slot = new_finalized_slot;
            *standstill_timer = Instant::now();
            // Set root
            let root_bank = root_bank_cache.root_bank();
            if root_bank.slot() > *current_root {
                *current_root = root_bank.slot();
                cert_pool.handle_new_root(root_bank);
            }

Maybe cleaner to create a struct with all the mutables from the loop to pass around instead of all the arguments as well:

&mut current_root,
&mut highest_finalized_slot,
&mut standstill_timer,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think the standstill_timer maybe doesn't belong here either....

Since this is code refactoring, can this change maybe go into another PR instead of piggying back on this BLS cert all to all change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay w/ doing it in a follow-up 👍

}
// Send new certificates to peers
for certificate in new_certificates_to_send {
// bls_sender is an unbounded channel, so error only happens if the receiver is disconnected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we bound this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is verified and forwarded by our own certificate pool before getting here, and we should only send once per (slot, cert_type), I think it's okay if we don't bound the channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bound, we can pre-allocate a ring buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay, that's fair. Although I probably won't build certificate resend here, I'll just drop and log error, since in normal cases the buffer shouldn't overflow.

The BLSOps to send out should be biggest during standstill handling. Currently we are saying after 10 seconds standstill, send out all own votes and certificates, since 10/0.4 = 25, and we can have <= 5 votes and <= 5 certs per slot, I'll cap the channel at 512.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512 sounds reasonable to me. Might be nice to log a drop metric as well (easier to query at scale and see if we have a bug)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Err currently we have no stats in certificate_pool_service at all. Can I do that in a follow up PR to properly add stats for drops and everything else?

let (bls_sender, bls_receiver) = unbounded();
// The BLS sender channel should be mostly used during standstill handling,
// there could be 10s/400ms = 25 slots, <=5 votes and <=5 certificates per slot,
// we cap the channel at 512 to give some headroom.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment!

Copy link
Contributor

@qkniep qkniep Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 25 slots is not an upper bound though, only what we would expect if the network recovered in the meantime (if e.g. we have a network partition for >10s this does not hold)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the network is in a partition where everyone is in a partition with <60% stake, then we shouldn't have many votes and certs constructed during the partition? If one partition is > 60% stake, then others should be able to observe new finalized cert and continue during recovery.

@wen-coding wen-coding merged commit db832de into anza-xyz:master Jul 22, 2025
7 checks passed
@wen-coding wen-coding deleted the bls_cert_all_to_all_2 branch July 22, 2025 16:23
@wen-coding wen-coding linked an issue Jul 25, 2025 that may be closed by this pull request
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 1, 2025
bw-solana pushed a commit to bw-solana/alpenglow that referenced this pull request Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broadcast new certificates in certificate_service to all-to-all

4 participants