Skip to content

Conversation

@mateusz-sekara
Copy link
Collaborator

@mateusz-sekara mateusz-sekara commented Nov 6, 2025

More defensive approach to creating CCIP Oracles. So far, a single oracle creation failure has bubbled up and stopped the entire process. Therefore single faulty definition of a job or an invalid configuration impacted the entire node by bringing down all the oracles.

Right now, the approach is more defensive / best-effort. We try to create as many oracles as possible without interrupting the process on error. Launcher ticks in intervals, which will take care of the retries. If the config is faulty, it would keep erroring on the log level without impacting other oracles.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@trunk-io
Copy link

trunk-io bot commented Nov 6, 2025

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

@cl-sonarqube-production
Copy link

oracle, err := oracleCreator.Create(ctx, don.ID, cctypes.OCR3ConfigWithMeta(config))
if err != nil {
return nil, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest)
aggErr = errors.Join(aggErr, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aggErr = errors.Join(aggErr, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest))
aggErr = errors.Join(aggErr, fmt.Errorf("failed to create CCIP oracle: %w for digest %x", err, digest[:]))

Since digest is a [32]byte I've seen this show up in logs as a super large integer rather than the expected hex value - I think adding the slice notation might fix it? Worth checking in a test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

digest is of ConfigDigest type which has String() function defined https://github.com/smartcontractkit/libocr/blob/7324c200a6b0fff57449fc90025eb8b99df5e735/offchainreporting2plus/types/config_digest.go#L98

Therefore, it's always printed nicely. Checked in the test manually

Copy link
Contributor

@makramkd makramkd Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch the format verb to %s then? Cuz this is what I see in our production logs as the error:

{
  "level": "error",
  "ts": "2025-10-29T15:59:06.131Z",
  "caller": "launcher/launcher.go:147",
  "msg": "Failed to tick",
  "version": "2.29.0@00848b1",
  "err": "failed to process diff: processAdded: call createDON 23: failed to create CCIP oracle: failed to create CCIPProviders: failed to get chain selector from relay ID 26888 and family evm: invalid chain id 26888 for evm for digest 30303061656263393431663534626235373935643936323531613035313335333163613238323366663565343830323535613736313838313363363631633763",
  "stacktrace": "github.com/smartcontractkit/chainlink/v2/core/capabilities/ccip/launcher.(*launcher).monitor\n\t/chainlink/core/capabilities/ccip/launcher/launcher.go:147"
}

Notice the part after the for digest

Comment on lines 257 to 263
if err != nil {
return fmt.Errorf("processAdded: call createDON %d: %w", donID, err)
l.lggr.Errorw("Some oracles failed to be created", "donID", donID, "err", err)
}
if len(newPlugins) == 0 {
// not a member of this DON.
// not a member of this DON or no oracles could be created
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do an explicit continue in the err != nil case instead:

if err != nil {
    l.lggr.Errorw("Some oracles failed to be created", "donID", donID, "err", err)
    continue
}
if len(newPlugins) == 0 {
    // not a member of this DON
    continue
}

My reasoning is that newPlugins is an invalid return value is err != nil and shouldn't be checked. I guess we can even add a more explicit log in the len(newPlugins) == 0 case (and no error) but it seems to be logged already in the createDON and updateDON functions

Copy link
Collaborator Author

@mateusz-sekara mateusz-sekara Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was the following: createDON can return 3 different value sets

// 1. all plugins created without problems, error is nil and newPlugins is populated
newPlugins, err := map[], nil

// 2. nothing was created 
newPlugins, err := nil, "some aggregated errors"

// 3. partial state in which some oracles were created but some failed to be created
newPlugins, err := map[], "some oracles failed"

So if I understand correctly, we can't continue right after if err != nil, because newPlugins might contain some initialized oracles that need to be started

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is against Go rules to return valid results together with an error then I can only log them in createDON and that will simplify the handling logic here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah returning valid results with an error seems more complicated to me. continue-ing in this loop is fine (we'll end up re-trying this DON) but if we update the state of the launcher w/ a half-started DON then we will never retry, if I'm understanding this code correctly.

Comment on lines 274 to 276
// update state.
l.instances[donID] = newPlugins
l.regState.IDsToDONs[donID] = added[donID]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we'll continue to retry creating the DON that failed to be created, since this state would not update for the failed DON, and the diff of the latest registry state and the stored state will produce something. I think this makes sense as behavior of this launcher.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have a retry mechanism already baked in. We will keep retrying as long as the launcher keeps ticking (which is till the process is up)

Comment on lines 218 to 224
if err != nil {
return err
l.lggr.Errorw("Some oracles failed to be created", "donID", donID, "err", err)
}
if len(newPlugins) == 0 {
// not a member of this DON.
// not a member of this DON or no oracles could be created
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here re: where to put the continue, but open to hearing thoughts on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants