-
Notifications
You must be signed in to change notification settings - Fork 238
fix: semaphore use for StoreChunks #1866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest Buf updates on your PR. Results from workflow Buf Proto / buf (pull_request).
|
probe.SetStage("acquire_buffer_capacity") | ||
semaphoreCtx, cancel := context.WithTimeout(ctx, s.node.Config.StoreChunksBufferTimeout) | ||
defer cancel() | ||
err = s.node.StoreChunksSemaphore.Acquire(semaphoreCtx, int64(downloadSizeInBytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This blocks until the request size in bytes is free or the semaphoreCtx is done
right. This essentially means the storechunks request from the disperser fails incase the previous requests are still being processed or so ? Does this mean disperser needs to retry on this error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially means the storechunks request from the disperser fails incase the previous requests are still being processed
Correct. The intention of this change is to prevent too many chunks from being in memory at a single instant in time.
Does this mean disperser needs to retry on this error ?
I've currently got dispersal retries disabled, as the current implementation is extremely inefficient. So if this triggers, the validator will not end up signing for some batches. I think this is ok though... if the disperser is sending more work than the validator can handle, it NEEDS to skip some batches or else it will accumulate a backlog that will eventually lead to OOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially failed dispersals is what we are agreeing to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly worth having a discussion on this.
I'm not convinced it's possible to have a robust, high performance system without the ability shed load when under high stress. It's the difference between a system that recovers from a spike too large to handle, and a system that face plants when traffic spikes above a critical threshold.
node/config.go
Outdated
@@ -424,7 +412,7 @@ func NewConfig(ctx *cli.Context) (*Config, error) { | |||
LittDBReadCacheSizeBytes: uint64(ctx.GlobalFloat64(flags.LittDBReadCacheSizeGBFlag.Name) * units.GiB), | |||
LittDBReadCacheSizeFraction: ctx.GlobalFloat64(flags.LittDBReadCacheSizeFractionFlag.Name), | |||
LittDBStoragePaths: ctx.GlobalStringSlice(flags.LittDBStoragePathsFlag.Name), | |||
LittUnsafePurgeLocks: ctx.GlobalBool(flags.LittUnsafePurgeLocksFlag.Name), | |||
LittRespectLocks: ctx.GlobalBool(flags.LitRespectLocksFlag.Name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LittRespectLocks: ctx.GlobalBool(flags.LitRespectLocksFlag.Name), | |
LittRespectLocks: ctx.GlobalBool(flags.LittRespectLocksFlag.Name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Why are these changes needed?
Fixes the way StoreChunks() uses a semaphore, old pattern never released semaphore.