- 
                Notifications
    
You must be signed in to change notification settings  - Fork 594
 
agentgateway: report nacks via gateway status #12770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
agentgateway: report nacks via gateway status #12770
Conversation
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
…an limit the perf damage we're doing here Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
| 
               | 
          ||
| // onAck publishes an ACK event as a k8s event. | ||
| func (p *Publisher) onAck(event AckEvent) { | ||
| recoveredNackID := ComputeNackID(event.Gateway.Namespace+"/"+event.Gateway.Name, event.TypeUrl) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is getting computed twice? Is it different from AnnotationNackID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to compute it again here to know which nack this would resolve since theoretically we could have more than one nack at a time on the same gateway with different TypeURL
| } | ||
| 
               | 
          ||
| // onNack publishes a NACK event as a k8s event. | ||
| func (p *Publisher) onNack(event NackEvent) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work for multiple gateways (like multiple replicas of one k8s Gateway resource?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would report multiple events (one for each replica which encountered a nack), but they ultimately all have the same involved object, so the gateway would still have the programmed false condition only added once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now that we've switched to event recorder we should only create one event per gateway here
Signed-off-by: Joe McGuire <[email protected]>
…emaining nacks Signed-off-by: Joe McGuire <[email protected]>
d1de508    to
    d93746e      
    Compare
  
    Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]> Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
…ing it Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: jmcguire98 <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
Description
We want to be able to more easily surface to end users cases where configuration they have provided has been translated into config that was rejected (NACKed) by the gateway they were sending it to. In the past this most frequently happened when a user created a policy with illegal CEL which was nacked at the gateway (subsequently we added CEL validation before translating).
Achieving status reporting here is complicated by the fact that kgateway can be deployed with replicas and leader election enabled. Currently we only allow the leader to write to status, and this works because the leader does translate every resource passed into it. However, we can't follow the exact same pattern for NACK reporting because only the replica with an active xDS connection to a specific gateway can detect NACKs from that gateway (the leader may not have
connections to all gateways).
So, we needed a mechanism to surface NACKs across replicas in order for the leader to adequately report status across all gateways. To achieve this I have added a NACK publisher that publishes k8s events when an instance of a NACK occurs, and publishes a subsequent ACK event when the issue causing that NACK is resolved.
This enables the leader to report NACK statuses via a krt collection of events (filtered to events specifically involving gateways to lighten performance costs).
Change Type
/kind new_feature
Changelog
Additional Notes
fixes #12671