Skip to content

Conversation

tcp13equals2
Copy link
Contributor

@tcp13equals2 tcp13equals2 commented Jul 31, 2025

This is a PR for the implementation of a custom pod disruption budget which is zone and partition aware.

#194

Included in this PR;

  • new validating web hook for handling pod eviction requests
  • new custom resource definition to define the new zone aware pod disruption (zpdb) kind (configuration)
  • validating webhook configuration to ensure only valid zpdb configurations are applied
  • updated development sample files and examples
  • updated integration tests
  • includes support for both zone and partition aware zpdb configurations

Note - this has been tested in mimir-dev-11 in the classic zones configuration draining multiple nodes which span multiple zones + pods.

Note - from a deployment perspective, when this is rolled out to real clusters it is likely it will have to rolled out in 2 phases. Phase 1 would have the updated rollout operator, new CRD, the validating webhook configurations but not the actual new zpdb configuration. This kind relies on the validating webhook to be in-place first, and if this validating webhook has just been added it can be a short wait until it has the CA bundle patched in allowing the endpoint to function. If the zpdb is applied immediately after the validating webhook is added it will fail. So Phase 2 would be rolling out the actual zpdb files which replace the existing pdb file.

Deployment tools libsonnet updates - https://github.com/grafana/deployment_tools/pull/323553

Mimir libsonnet updates - grafana/mimir#12315 (in progress)

@CLAassistant
Copy link

CLAassistant commented Jul 31, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an integration test to cover this functionality?

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing the tests - will take a look tomorrow.

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest splitting out the changes related to updating the TLS certificate configuration to their own PR, so that this PR remains focused on the ZPDB implementation.

I still need to review the tests in pkg/admission/pod_eviction_test.go.

Comment on lines 26 to 40
statefulSetZoneA = "ingester-zone-a"
statefulSetZoneB = "ingester-zone-b"
statefulSetZoneC = "ingester-zone-c"
testPodZoneA0 = "ingester-zone-a-0"
testPodZoneA1 = "ingester-zone-a-1"
testPodZoneA2 = "ingester-zone-a-2"
testPodZoneB0 = "ingester-zone-b-0"
testPodZoneB1 = "ingester-zone-b-1"
testPodZoneB2 = "ingester-zone-b-2"
testPodZoneC0 = "ingester-zone-c-0"
testPodZoneC1 = "ingester-zone-c-1"
testPodZoneC2 = "ingester-zone-c-2"
testNamespace = "test-dev-0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I'm not sure these constants add much - I'd be inclined to inline them.

Includes;

* new web-hook handler for pod eviction added to rollout-operator
* manifests for new custom resource definition for configuring the PDBZ
* updated development example files and environment
* rename to ZoneAwarePodDisruptionBudget
* updated root README
* updates to main eviction handler per PR
*
* updates per PR review
* added ZPDB informer to monitor ZPDB configurations
* added Pod informer to work with pod eviction cache
* added pod eviction cache
* added validating webhook configuration for ZPDB files
* added locking for concurrent eviction handling
Modify the rollout-operator to use an informer to monitor for new webhook configurations.

Webhook configurations need to be patched with a self signed CA, and previously this was only checked and applied on rollout-operator startup.

This change allows for new webhooks to be created once the rollout-operator is running and have the CA patch applied.
Modify the rollout-operator to use an informer to monitor for new webhook configurations.

Webhook configurations need to be patched with a self signed CA, and previously this was only checked and applied on rollout-operator startup.

This change allows for new webhooks to be created once the rollout-operator is running and have the CA patch applied.
@tcp13equals2 tcp13equals2 marked this pull request as ready for review August 14, 2025 06:57
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you might have missed my earlier comments on pkg/admission/pod_eviction_test.go.

Comment on lines 60 to 65
func (c *config) min(a int32, b int32) int32 {
if a < b {
return a
}
return b
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go has a built-in min function that we should be able to use instead of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the math.min() which applies to float64 ... is the preference to cast to a float and then back to an int32 or is there another function?

Copy link
Contributor

@aknuds1 aknuds1 Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tcp13equals2 Charles is referring to the built-in min function, which is generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha - nice. Thanks for the tip!

Copy link
Collaborator

@andyasp andyasp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did my best to read through this to see if anything jumped out at me. Only found a couple small things. I still don't feel like I have a great mental model of this yet, but this has clearly already been through a lot of review by Charles.

@tcp13equals2 tcp13equals2 merged commit e9aa5cc into main Aug 22, 2025
11 checks passed
@tcp13equals2 tcp13equals2 deleted the andrewhall-194-support-per-zone-pdb branch August 22, 2025 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants