Skip to content

Conversation

@jcpowermac
Copy link
Contributor

This commit fixes a critical issue where the machine-api-operator was creating and destroying vCenter REST API sessions on every machine reconciliation, causing excessive login/logout cycles that pollute vCenter audit logs and create unnecessary session churn.

Root Cause:
The WithRestClient() and WithCachingTagsManager() wrapper functions were creating new REST sessions, performing operations, and immediately logging out on every invocation. With hundreds of machines reconciling periodically, this created a constant stream of login/logout events.

Solution (inspired by cluster-api-provider-vsphere):

  • Add TagManager field to Session struct to cache REST client
  • Initialize and cache REST client during session creation (GetOrCreate)
  • Validate both SOAP and REST session health before reusing cached sessions
  • Add GetCachingTagsManager() helper for direct access to cached tag manager
  • Update reconcileRegionAndZoneLabels() to use cached tag manager
  • Update reconcileTags() to use cached tag manager
  • Deprecate WithRestClient() and WithCachingTagsManager() for backward compatibility

Key Changes:

  1. pkg/controller/vsphere/session/session.go:

    • Added TagManager *tags.Manager field to Session struct
    • Modified GetOrCreate() to create and cache REST client once
    • Added dual session validation (SOAP + REST) before reusing sessions
    • Added GetCachingTagsManager() method for direct access
    • Deprecated old wrapper functions with migration guidance
  2. pkg/controller/vsphere/reconciler.go:

    • Updated reconcileRegionAndZoneLabels() to use GetCachingTagsManager()
    • Updated reconcileTags() to use GetCachingTagsManager()
    • Eliminated callback pattern in favor of direct access

Impact:

  • Eliminates excessive vCenter login/logout cycles
  • Reduces vCenter session churn from O(reconciliations) to O(1) per MAPI instance
  • Improves performance by removing authentication overhead on every tag operation
  • REST session now lives as long as SOAP session (until invalidation)

Reference Implementation:
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/pkg/session/session.go

Backward Compatibility:
The deprecated wrapper functions are maintained with warning logs to support existing test code. All production code paths now use the new pattern.

Fixes: Excessive vCenter logout events reported by customers

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcpowermac
Copy link
Contributor Author

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

@jcpowermac: The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-virtualmedia
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-crds-sync
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-aws-operator-techpreview
/test e2e-azure-manual-oidc
/test e2e-azure-operator
/test e2e-azure-operator-techpreview
/test e2e-azure-ovn
/test e2e-gcp-operator
/test e2e-gcp-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-upgrade
/test e2e-nutanix
/test e2e-nutanix-operator-multi-subnet
/test e2e-openstack
/test e2e-vsphere-host-groups-ovn-techpreview
/test e2e-vsphere-operator
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-multi-vcenter
/test e2e-vsphere-ovn-serial
/test e2e-vsphere-ovn-techpreview
/test e2e-vsphere-ovn-techpreview-serial
/test e2e-vsphere-ovn-upgrade
/test e2e-vsphere-static-ovn
/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-operator-main-e2e-aws-ovn
pull-ci-openshift-machine-api-operator-main-goimports
pull-ci-openshift-machine-api-operator-main-golint
pull-ci-openshift-machine-api-operator-main-govet
pull-ci-openshift-machine-api-operator-main-images
pull-ci-openshift-machine-api-operator-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-operator-main-okd-scos-images
pull-ci-openshift-machine-api-operator-main-unit
pull-ci-openshift-machine-api-operator-main-verify-crds-sync
pull-ci-openshift-machine-api-operator-main-verify-deps
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jcpowermac
Copy link
Contributor Author

/test e2e-vsphere-ovn-serial

@jcpowermac
Copy link
Contributor Author

/test e2e-vsphere-ovn

@jcpowermac
Copy link
Contributor Author

Testing with #1446 showed this is a problem, moving forward with testing this.

/test e2e-vsphere-ovn

@jcpowermac
Copy link
Contributor Author

/test golint

@jcpowermac jcpowermac force-pushed the fix-vsphere-rest-session-caching branch from 2caad48 to 3b31029 Compare December 10, 2025 15:22
@jcpowermac jcpowermac changed the title vsphere: Cache REST API sessions to prevent excessive vCenter logouts OCPBUGS-64937: vsphere - Cache REST API sessions to prevent excessive vCenter logouts Dec 10, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 10, 2025
@openshift-ci-robot
Copy link
Contributor

@jcpowermac: This pull request references Jira Issue OCPBUGS-64937, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This commit fixes a critical issue where the machine-api-operator was creating and destroying vCenter REST API sessions on every machine reconciliation, causing excessive login/logout cycles that pollute vCenter audit logs and create unnecessary session churn.

Root Cause:
The WithRestClient() and WithCachingTagsManager() wrapper functions were creating new REST sessions, performing operations, and immediately logging out on every invocation. With hundreds of machines reconciling periodically, this created a constant stream of login/logout events.

Solution (inspired by cluster-api-provider-vsphere):

  • Add TagManager field to Session struct to cache REST client
  • Initialize and cache REST client during session creation (GetOrCreate)
  • Validate both SOAP and REST session health before reusing cached sessions
  • Add GetCachingTagsManager() helper for direct access to cached tag manager
  • Update reconcileRegionAndZoneLabels() to use cached tag manager
  • Update reconcileTags() to use cached tag manager
  • Deprecate WithRestClient() and WithCachingTagsManager() for backward compatibility

Key Changes:

  1. pkg/controller/vsphere/session/session.go:
  • Added TagManager *tags.Manager field to Session struct
  • Modified GetOrCreate() to create and cache REST client once
  • Added dual session validation (SOAP + REST) before reusing sessions
  • Added GetCachingTagsManager() method for direct access
  • Deprecated old wrapper functions with migration guidance
  1. pkg/controller/vsphere/reconciler.go:
  • Updated reconcileRegionAndZoneLabels() to use GetCachingTagsManager()
  • Updated reconcileTags() to use GetCachingTagsManager()
  • Eliminated callback pattern in favor of direct access

Impact:

  • Eliminates excessive vCenter login/logout cycles
  • Reduces vCenter session churn from O(reconciliations) to O(1) per MAPI instance
  • Improves performance by removing authentication overhead on every tag operation
  • REST session now lives as long as SOAP session (until invalidation)

Reference Implementation:
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/pkg/session/session.go

Backward Compatibility:
The deprecated wrapper functions are maintained with warning logs to support existing test code. All production code paths now use the new pattern.

Fixes: Excessive vCenter logout events reported by customers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jcpowermac
Copy link
Contributor Author

/test all

@jcpowermac
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 10, 2025
@openshift-ci-robot
Copy link
Contributor

@jcpowermac: This pull request references Jira Issue OCPBUGS-64937, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

This commit fixes a critical issue where the machine-api-operator was
creating and destroying vCenter REST API sessions on every machine
reconciliation, causing excessive login/logout cycles that pollute
vCenter audit logs and create unnecessary session churn.

Root Cause:
The WithRestClient() and WithCachingTagsManager() wrapper functions
were creating new REST sessions, performing operations, and immediately
logging out on every invocation. With hundreds of machines reconciling
periodically, this created a constant stream of login/logout events.

Solution (inspired by cluster-api-provider-vsphere):
- Add TagManager field to Session struct to cache REST client
- Initialize and cache REST client during session creation (GetOrCreate)
- Validate both SOAP and REST session health before reusing cached sessions
- Add GetCachingTagsManager() helper for direct access to cached tag manager
- Update reconcileRegionAndZoneLabels() to use cached tag manager
- Update reconcileTags() to use cached tag manager
- Deprecate WithRestClient() and WithCachingTagsManager() for backward compatibility

Key Changes:
1. pkg/controller/vsphere/session/session.go:
   - Added TagManager *tags.Manager field to Session struct
   - Modified GetOrCreate() to create and cache REST client once
   - Added dual session validation (SOAP + REST) before reusing sessions
   - Added GetCachingTagsManager() method for direct access
   - Deprecated old wrapper functions with migration guidance

2. pkg/controller/vsphere/reconciler.go:
   - Updated reconcileRegionAndZoneLabels() to use GetCachingTagsManager()
   - Updated reconcileTags() to use GetCachingTagsManager()
   - Eliminated callback pattern in favor of direct access

Impact:
- Eliminates excessive vCenter login/logout cycles
- Reduces vCenter session churn from O(reconciliations) to O(1) per MAPI instance
- Improves performance by removing authentication overhead on every tag operation
- REST session now lives as long as SOAP session (until invalidation)

Reference Implementation:
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/pkg/session/session.go

Backward Compatibility:
The deprecated wrapper functions are maintained with warning logs to support
existing test code. All production code paths now use the new pattern.

Fixes: Excessive vCenter logout events reported by customers
Signed-off-by: Claude Code Assistant <[email protected]>
Co-Authored-By: Claude <[email protected]>
@jcpowermac jcpowermac force-pushed the fix-vsphere-rest-session-caching branch from 3b31029 to 114d062 Compare December 11, 2025 19:57
@jcpowermac jcpowermac marked this pull request as ready for review December 11, 2025 19:57
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025
@jcpowermac
Copy link
Contributor Author

/test e2e-vsphere-ovn-techpreview-serial
/test e2e-vsphere-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@jcpowermac: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-serial 2caad48 link false /test e2e-vsphere-ovn-serial
ci/prow/e2e-vsphere-ovn-techpreview-serial 114d062 link false /test e2e-vsphere-ovn-techpreview-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rvanderp3
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2025
@jcpowermac
Copy link
Contributor Author

/assign @damdo

@jcpowermac
Copy link
Contributor Author

/verified by @jcpowermac via local install and ci

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 15, 2025
@openshift-ci-robot
Copy link
Contributor

@jcpowermac: This PR has been marked as verified by @jcpowermac via local install and ci.

Details

In response to this:

/verified by @jcpowermac via local install and ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants