Skip to content

Conversation

dmanc
Copy link
Contributor

@dmanc dmanc commented Aug 6, 2025

Why are these changes needed?

This PR adds a defensive check to ensure that trying to generate tables > loaded SRS size will not panic but instead return an error.

Original issue:

2025/08/05 07:39:49 Table with params: DimE=4 CosetSize=2097152 does not exist leads to:

panic: runtime error: index out of range [6291450] with length 2097152

github.com/Layr-Labs/eigenda/encoding/kzg/prover.(*SRSTable).PrecomputeSubTable(...)
    /app/encoding/kzg/prover/precompute.go:188 +0x2ca
    - Parameters: (0xc013d73a00, 0xc0134f64e8?, 0xf971255373de12?, 0x3, 0x90000000ffffffff?, 0x5, 0x200000)

github.com/Layr-Labs/eigenda/encoding/kzg/prover.(*SRSTable).precomputeWorker(...)
    /app/encoding/kzg/prover/precompute.go:151 +0x10b
    - Parameters: (0xc00590e300?, 0xc041baefd0?, 0x44843c?, 0x14054a8?, 0xa?, 0xc00590e300?, 0xc000504000?, 0xc01ddcce90?)

created by github.com/Layr-Labs/eigenda/encoding/kzg/prover.(*SRSTable).Precompute
    /app/encoding/kzg/prover/precompute.go:173 +0xa5
    - Origin goroutine: 471265176

This is because we set the following configuration

name: DISPERSER_ENCODER_SRS_LOAD
value: "2097152"

The root issue is that if an SRS table is not precomputed we will try to compute it and store it. Computing an SRS table requires more points to be loaded so we hit a panic due to overflowing the buffer.

Options are:

  1. Have more points loaded.
  2. Add defensive check (this PR)
  3. Remove support for computing SRS tables on the fly.

This issue is not present on the V2 encoder since numChunks is fixed to 8192 which limits the amount of SRSTables we need precomputed. V1 encoding parameters are determined based on the operator state and so we've started to hit this issue due to instability on the Holesky operator state.

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Copy link

github-actions bot commented Aug 6, 2025

The latest Buf updates on your PR. Results from workflow Buf Proto / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedAug 6, 2025, 8:13 PM

cody-littley
cody-littley previously approved these changes Aug 6, 2025
cody-littley
cody-littley previously approved these changes Aug 6, 2025
bxue-l2
bxue-l2 previously approved these changes Aug 6, 2025
Copy link
Contributor

@bxue-l2 bxue-l2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dmanc dmanc changed the title Check bounds before precomputing SRS table fix: Check bounds before precomputing SRS table Aug 6, 2025
@dmanc dmanc dismissed stale reviews from bxue-l2 and cody-littley via c65f4aa August 6, 2025 20:13
@anupsv
Copy link
Contributor

anupsv commented Aug 6, 2025

I wonder if this is cacheable and load it or something ?

@dmanc
Copy link
Contributor Author

dmanc commented Aug 6, 2025

I wonder if this is cacheable and load it or something ?

We do cache a lot of the tables already. The problem is when the operator state changes in V1 we end up with some weird parameters like (DimE=4, CosetSize=2097152) which causes the encoders to panic and crash loop.

The mitigation to the crashloop so far has been to generate that table on a separate machine and add it to the SRSTables folder for each encoder. This change would just allow us to emit an error message rather than crash loop the encoders.

@dmanc dmanc added this pull request to the merge queue Aug 7, 2025
Merged via the queue into master with commit 9c00da8 Aug 7, 2025
24 checks passed
@dmanc dmanc deleted the encoder-dont-panic branch August 7, 2025 17:31
@samlaf
Copy link
Collaborator

samlaf commented Aug 11, 2025

I wonder if this is cacheable and load it or something ?

We do cache a lot of the tables already. The problem is when the operator state changes in V1 we end up with some weird parameters like (DimE=4, CosetSize=2097152) which causes the encoders to panic and crash loop.

The mitigation to the crashloop so far has been to generate that table on a separate machine and add it to the SRSTables folder for each encoder. This change would just allow us to emit an error message rather than crash loop the encoders.

This is just V1 so doesn't really matter I guess, but going forward I've been trying to install that we start using https://joeduffyblog.com/2016/02/07/the-error-model/#bugs-arent-recoverable-errors instead of returning errors. Are we sure that this error is being sent back to the user as a 500 for example? A panic and recover (using https://pkg.go.dev/github.com/grpc-ecosystem/go-grpc-middleware/recovery) would give us that for free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants