Skip to content

Conversation

@jmcguire98
Copy link
Contributor

@jmcguire98 jmcguire98 commented Oct 29, 2025

Description

We want to be able to more easily surface to end users cases where configuration they have provided has been translated into config that was rejected (NACKed) by the gateway they were sending it to. In the past this most frequently happened when a user created a policy with illegal CEL which was nacked at the gateway (subsequently we added CEL validation before translating).

Achieving status reporting here is complicated by the fact that kgateway can be deployed with replicas and leader election enabled. Currently we only allow the leader to write to status, and this works because the leader does translate every resource passed into it. However, we can't follow the exact same pattern for NACK reporting because only the replica with an active xDS connection to a specific gateway can detect NACKs from that gateway (the leader may not have
connections to all gateways).

So, we needed a mechanism to surface NACKs across replicas in order for the leader to adequately report status across all gateways. To achieve this I have added a NACK publisher that publishes k8s events when an instance of a NACK occurs, and publishes a subsequent ACK event when the issue causing that NACK is resolved.

This enables the leader to report NACK statuses via a krt collection of events (filtered to events specifically involving gateways to lighten performance costs).

Change Type

/kind new_feature

Changelog

Added status reporting on agentgateway gateways that indicates when a gateway has nacked an update

Additional Notes

fixes #12671

Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
…an limit the perf damage we're doing here

Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
@github-actions github-actions bot added kind/feature Categorizes issue or PR as related to a new feature. release-note labels Oct 29, 2025
@npolshakova npolshakova self-requested a review October 29, 2025 19:01
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

// onAck publishes an ACK event as a k8s event.
func (p *Publisher) onAck(event AckEvent) {
recoveredNackID := ComputeNackID(event.Gateway.Namespace+"/"+event.Gateway.Name, event.TypeUrl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is getting computed twice? Is it different from AnnotationNackID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to compute it again here to know which nack this would resolve since theoretically we could have more than one nack at a time on the same gateway with different TypeURL

}

// onNack publishes a NACK event as a k8s event.
func (p *Publisher) onNack(event NackEvent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work for multiple gateways (like multiple replicas of one k8s Gateway resource?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would report multiple events (one for each replica which encountered a nack), but they ultimately all have the same involved object, so the gateway would still have the programmed false condition only added once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that we've switched to event recorder we should only create one event per gateway here

@jmcguire98 jmcguire98 force-pushed the report-agw-nacks-on-gateways branch from d1de508 to d93746e Compare October 29, 2025 21:04
@jmcguire98 jmcguire98 marked this pull request as ready for review October 31, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agentgateway: report statuses on gateways if the dataplane NACKs an update

2 participants