Skip to content

feat: operator watch namespaces #6434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

RobertSamoilescu
Copy link
Contributor

@RobertSamoilescu RobertSamoilescu commented Apr 30, 2025

What this PR does / why we need it:

This PR implements the option to configure the operator to watch specific namespaces.

To install the operator to watch specific namespaces, run:

helm upgrade seldon-core-v2-setup ./seldon-core-v2-setup/   --namespace seldon-mesh   --set "controller.watchNamespaces={seldon-mesh,seldon-mesh-2}"   --set controller.clusterwide=false   --install

In the example above, the operator will watch only seldon-mesh and seldon-mesh-2 namespaces.

Configuration details:

  • if clusterwide flag is set to true and watchNamespace is not set, then all the namespaces will be watched
  • if clusterwide flag is set to true and watchNamespaces is set, then the operator will watch the namespaces specified
  • if clusterwide flag is set to false, then the operator will watch the POD_NAMESPACE which is the same as the release namespace.

Additionally, if a user wants to have a controller in ns1 watching the seldon-mesh-1, and another controller in ns2 watching seldon-mesh-2, they can use --set controller.skipClusterRoleCreation to overcome the error from recreating the same cluster role twice.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@RobertSamoilescu RobertSamoilescu requested a review from lc525 as a code owner April 30, 2025 11:09
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an additional situation we will want to cover, but we might choose to reprioritise and deal with it in a separate PR:

  • clusterwide: True , watchNamespaces not set -- watch all namespaces
  • clusterwide: True, watchNamespaces is set and we allow the operator ClusterRole/ClusterRoleBinding permissions. Here, the operator only watches some namespaces but has the permissions to watch any namespace. In security-restricted environments, this is problematic. However, other operators like postgres or spark also only offer this option.
  • clusterwide: False, watchNamespaces not set -- only watch the namespace where the operator is installed
  • clusterwide: False, watchNamespaces is set -- we only give permissions to the operator to watch the specific list of namespaces. The helm chart needs to create RoleBindings in each namespace in the list (or which matches a specific criteria) during the call to helm install or helm upgrade. These RoleBindings should give the operator’s service account the necessary privileges in the namespace.

@lc525
Copy link
Member

lc525 commented May 1, 2025

Before we merge this, let's test for sane behaviour when instantiating two operators in ns1 and ns2, both watching the same namespace seldon-mesh. We only want one of them to do the actual management / connect to the scheduler and issue control-plane commands.

@RobertSamoilescu
Copy link
Contributor Author

Before we merge this, let's test for sane behaviour when instantiating two operators in ns1 and ns2, both watching the same namespace seldon-mesh. We only want one of them to do the actual management / connect to the scheduler and issue control-plane commands.

In this case, although the scheduler seems to behave ok for model loading/unloading, there are some errors which pop up in the operator.

When loading:

2025-05-01T14:10:13.851Z    ERROR    Reconciler error    {"controller": "model", "controllerGroup": "mlops.seldon.io", "controllerKind": "Model", "Model": {"name":"iris","namespac │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler                                                                                          │
│     sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:341                                                                                                │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem                                                                                       │
│     sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:288                                                                                                │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2                                                                                             │
│     sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249  

When unloading:

2025-05-01T14:11:15.778Z    ERROR    schedulerClient.SubscribeModelEvents    Failed to update status    {"model": "iris", "error": "Model.mlops.seldon.io \"iris\" not found"}      │
│ github.com/seldonio/seldon-core/operator/v2/scheduler.(*SchedulerClient).SubscribeModelEvents                                                                                       │
│     github.com/seldonio/seldon-core/operator/v2/scheduler/model.go:270                                                                                                              │
│ github.com/seldonio/seldon-core/operator/v2/scheduler.retryFn.func2                                                                                                                 │
│     github.com/seldonio/seldon-core/operator/v2/scheduler/client.go:327                                                                                                             │
│ github.com/cenkalti/backoff/v4.RetryNotifyWithTimer.Operation.withEmptyData.func1                                                                                                   │
│     github.com/cenkalti/backoff/[email protected]/retry.go:18                                                                                                                               │
│ github.com/cenkalti/backoff/v4.doRetryNotify[...]                                                                                                                                   │
│     github.com/cenkalti/backoff/[email protected]/retry.go:88                                                                                                                               │
│ github.com/cenkalti/backoff/v4.RetryNotifyWithTimer                                                                                                                                 │
│     github.com/cenkalti/backoff/[email protected]/retry.go:61                                                                                                                               │
│ github.com/cenkalti/backoff/v4.RetryNotify                                                                                                                                          │
│     github.com/cenkalti/backoff/[email protected]/retry.go:49                                                                                                                               │
│ github.com/seldonio/seldon-core/operator/v2/scheduler.retryFn                                                                                                                       │
│     github.com/seldonio/seldon-core/operator/v2/scheduler/client.go:329                                                                                                             │
│ github.com/seldonio/seldon-core/operator/v2/scheduler.(*SchedulerClient).startEventHanders.func1                                                                                    │
│     github.com/seldonio/seldon-core/operator/v2/scheduler/client.go:95    

This is a faulty configuration anyways, and maybe we should address it in a future PR.

@@ -183,7 +183,7 @@ rules:
- list
- watch
---
{{- if .Values.controller.clusterwide -}}
{{- if and (not .Values.controller.skipClusterRoleCreation) (or .Values.controller.clusterwide .Values.controller.watchNamespaces) -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cluster role specific to the operator. Perhaps we should reflect that in the value name? Something like skipOperatorClusterRoleCreation

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a small suggestion on the helm value name

@RobertSamoilescu RobertSamoilescu merged commit 89c7ab7 into SeldonIO:v2 May 2, 2025
3 checks passed
jtayl222 pushed a commit to jtayl222/seldon-core that referenced this pull request Jul 20, 2025
* Modified helm-chart to watch for specific namespaces

* Included watch-namespaces flag and validation of the flags

* Removed extra indent

* Fixed autogenerated charts

* Added new line at the end of patch_controller.yaml

* Refactored operator code so watchNamespaces if used in conjuction with clusterwide

* Ensure the current namespace is included in the cache config and imporved parsing

* Implemented the option to skip the creation of ClusterRole

* Replaced Release.Name with Release.Namespace

* Fixed rolebinding config

* Remove namespace from the name of rolebind since it is not required

* Improved flag description

* Renamed skipClusterRoleCreation to skipOperatorClusterRoleCreation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants