-
Notifications
You must be signed in to change notification settings - Fork 1.9k
CCIP-7920 Defensive CCIP Oracle creation #20265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
I see you updated files related to
|
|
| oracle, err := oracleCreator.Create(ctx, don.ID, cctypes.OCR3ConfigWithMeta(config)) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest) | ||
| aggErr = errors.Join(aggErr, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| aggErr = errors.Join(aggErr, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest)) | |
| aggErr = errors.Join(aggErr, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest[:])) |
Since digest is a [32]byte I've seen this show up in logs as a super large integer rather than the expected hex value - I think adding the slice notation might fix it? Worth checking in a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
digest is of ConfigDigest type which has String() function defined https://github.com/smartcontractkit/libocr/blob/7324c200a6b0fff57449fc90025eb8b99df5e735/offchainreporting2plus/types/config_digest.go#L98
Therefore, it's always printed nicely. Checked in the test manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe switch the format verb to %s then? Cuz this is what I see in our production logs as the error:
{
"level": "error",
"ts": "2025-10-29T15:59:06.131Z",
"caller": "launcher/launcher.go:147",
"msg": "Failed to tick",
"version": "2.29.0@00848b1",
"err": "failed to process diff: processAdded: call createDON 23: failed to create CCIP oracle: failed to create CCIPProviders: failed to get chain selector from relay ID 26888 and family evm: invalid chain id 26888 for evm for digest 30303061656263393431663534626235373935643936323531613035313335333163613238323366663565343830323535613736313838313363363631633763",
"stacktrace": "github.com/smartcontractkit/chainlink/v2/core/capabilities/ccip/launcher.(*launcher).monitor\n\t/chainlink/core/capabilities/ccip/launcher/launcher.go:147"
}Notice the part after the for digest
| if err != nil { | ||
| return fmt.Errorf("processAdded: call createDON %d: %w", donID, err) | ||
| l.lggr.Errorw("Some oracles failed to be created", "donID", donID, "err", err) | ||
| } | ||
| if len(newPlugins) == 0 { | ||
| // not a member of this DON. | ||
| // not a member of this DON or no oracles could be created | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should do an explicit continue in the err != nil case instead:
if err != nil {
l.lggr.Errorw("Some oracles failed to be created", "donID", donID, "err", err)
continue
}
if len(newPlugins) == 0 {
// not a member of this DON
continue
}My reasoning is that newPlugins is an invalid return value is err != nil and shouldn't be checked. I guess we can even add a more explicit log in the len(newPlugins) == 0 case (and no error) but it seems to be logged already in the createDON and updateDON functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning was the following: createDON can return 3 different value sets
// 1. all plugins created without problems, error is nil and newPlugins is populated
newPlugins, err := map[], nil
// 2. nothing was created
newPlugins, err := nil, "some aggregated errors"
// 3. partial state in which some oracles were created but some failed to be created
newPlugins, err := map[], "some oracles failed"So if I understand correctly, we can't continue right after if err != nil, because newPlugins might contain some initialized oracles that need to be started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is against Go rules to return valid results together with an error then I can only log them in createDON and that will simplify the handling logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah returning valid results with an error seems more complicated to me. continue-ing in this loop is fine (we'll end up re-trying this DON) but if we update the state of the launcher w/ a half-started DON then we will never retry, if I'm understanding this code correctly.
| // update state. | ||
| l.instances[donID] = newPlugins | ||
| l.regState.IDsToDONs[donID] = added[donID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll continue to retry creating the DON that failed to be created, since this state would not update for the failed DON, and the diff of the latest registry state and the stored state will produce something. I think this makes sense as behavior of this launcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we have a retry mechanism already baked in. We will keep retrying as long as the launcher keeps ticking (which is till the process is up)
| if err != nil { | ||
| return err | ||
| l.lggr.Errorw("Some oracles failed to be created", "donID", donID, "err", err) | ||
| } | ||
| if len(newPlugins) == 0 { | ||
| // not a member of this DON. | ||
| // not a member of this DON or no oracles could be created | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here re: where to put the continue, but open to hearing thoughts on it




More defensive approach to creating CCIP Oracles. So far, a single oracle creation failure has bubbled up and stopped the entire process. Therefore single faulty definition of a job or an invalid configuration impacted the entire node by bringing down all the oracles.
Right now, the approach is more defensive / best-effort. We try to create as many oracles as possible without interrupting the process on error. Launcher ticks in intervals, which will take care of the retries. If the config is faulty, it would keep erroring on the log level without impacting other oracles.