Skip to content

Conversation

cody-littley
Copy link
Contributor

Why are these changes needed?

Fixes the way StoreChunks() uses a semaphore, old pattern never released semaphore.

@cody-littley cody-littley requested a review from litt3 August 12, 2025 15:59
Copy link

github-actions bot commented Aug 12, 2025

The latest Buf updates on your PR. Results from workflow Buf Proto / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedAug 13, 2025, 3:32 PM

probe.SetStage("acquire_buffer_capacity")
semaphoreCtx, cancel := context.WithTimeout(ctx, s.node.Config.StoreChunksBufferTimeout)
defer cancel()
err = s.node.StoreChunksSemaphore.Acquire(semaphoreCtx, int64(downloadSizeInBytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blocks until the request size in bytes is free or the semaphoreCtx is done right. This essentially means the storechunks request from the disperser fails incase the previous requests are still being processed or so ? Does this mean disperser needs to retry on this error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially means the storechunks request from the disperser fails incase the previous requests are still being processed

Correct. The intention of this change is to prevent too many chunks from being in memory at a single instant in time.

Does this mean disperser needs to retry on this error ?

I've currently got dispersal retries disabled, as the current implementation is extremely inefficient. So if this triggers, the validator will not end up signing for some batches. I think this is ok though... if the disperser is sending more work than the validator can handle, it NEEDS to skip some batches or else it will accumulate a backlog that will eventually lead to OOM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially failed dispersals is what we are agreeing to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly worth having a discussion on this.

I'm not convinced it's possible to have a robust, high performance system without the ability shed load when under high stress. It's the difference between a system that recovers from a spike too large to handle, and a system that face plants when traffic spikes above a critical threshold.

litt3
litt3 previously approved these changes Aug 12, 2025
anupsv
anupsv previously approved these changes Aug 12, 2025
litt3
litt3 previously approved these changes Aug 12, 2025
@cody-littley cody-littley dismissed stale reviews from litt3 and anupsv via 2e36a68 August 13, 2025 14:05
litt3
litt3 previously approved these changes Aug 13, 2025
node/config.go Outdated
@@ -424,7 +412,7 @@ func NewConfig(ctx *cli.Context) (*Config, error) {
LittDBReadCacheSizeBytes: uint64(ctx.GlobalFloat64(flags.LittDBReadCacheSizeGBFlag.Name) * units.GiB),
LittDBReadCacheSizeFraction: ctx.GlobalFloat64(flags.LittDBReadCacheSizeFractionFlag.Name),
LittDBStoragePaths: ctx.GlobalStringSlice(flags.LittDBStoragePathsFlag.Name),
LittUnsafePurgeLocks: ctx.GlobalBool(flags.LittUnsafePurgeLocksFlag.Name),
LittRespectLocks: ctx.GlobalBool(flags.LitRespectLocksFlag.Name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LittRespectLocks: ctx.GlobalBool(flags.LitRespectLocksFlag.Name),
LittRespectLocks: ctx.GlobalBool(flags.LittRespectLocksFlag.Name),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@cody-littley cody-littley enabled auto-merge August 13, 2025 15:57
@cody-littley cody-littley added this pull request to the merge queue Aug 13, 2025
Merged via the queue into master with commit 11659e8 Aug 13, 2025
24 of 25 checks passed
@cody-littley cody-littley deleted the fix-storeChunks-semaphore branch August 13, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants