[feature] Add ability to specify node affinity & toleration using KFP V2

### Feature Area




/area sdk




### What feature would you like to see?
A core production use case of KFP involves being able to run CPU and GPU workload on specific nodegroups that are more powerful and different from the nodegroup where Kubeflow is installed and usually they will have autoscaling as well.
In order to achieve this, we used to be able to simply specify which component would run on which node using node affinity + tolerations. This is no longer possible in KFP v2 yet I feel like such a core feature should be supported.

### What is the use case or pain point?
The existing [set_accelerator_type](https://kubeflow-pipelines.readthedocs.io/en/sdk-2.0.1/source/dsl.html?h=add_#kfp.dsl.PipelineTask.set_accelerator_type.accelerator) is far from being flexible enough to allow such use case. Here is a small list of examples that shows that `set_accelerator_type` is not flexible enough to support production use cases :  
- Does not work if the GPU is not one of the few (3) supported GPU : `NVIDIA_TESLA_K80`, `TPU_V3` or `cloud-tpus.google.com/v3`. Otherwise we must use the generic `nvidia.com/gpu` which is not precise hence defeating the purpose of selecting an accelerator.
- If you have 2 nodegroups with the same GPU but one should be reserved for inference and one should be reserved for pipeline exectuion (eg: training) then there is no way to cover such distinction purely based on `set_accelerator_type('nvidia.com/gpu')`.
- this method is only meant to be used for GPU but it is common to want to run CPU workload on specific nodegroups, reasons could include nodegroup isolation (to run workload that wont affect the nodegroup where Kubeflow core pods run) or to allow for more powerful CPU nodegroups to be used for pipeline while Kubeflow would remain on cheaper instances. 

### Is there a workaround currently?  
Users can try to use external tools such as [Kyverno](https://kyverno.io/) to create mutating rules that a webhook can use to add a toleration and/or node affinity/node selector based on some predefined criteria such as a label name and value.  
It's still a pain since it makes it way more involved than being able to use `.add_node_affinity()` and `.add_toleration()` on a component. In fact, we can't even add a label using kfp sdk anymore so matching has to be done on labels that are present by chance (we have no control to explicitly ensure their presence).  
Also even using Kyverno, some cases might be hard or impossible to cover. For instance, if you have 2 kubeflow components, both will have the same labels yet you'd like one to run on a less expensive GPU nodegroup and only have one of the two run on a more powerful GPU nodegroup, then in that case since the pods have the same labels, the only way to specify which nodegroup it should run on is at component definition time (via KFP sdk) yet this is not supported currently in KFP v2.  
Given that Kubeflow's main goal is to lower the barrier to run ML on kubernetes, I believe this workaround goes against such goal and should not be the only solution that is available to people. It would be in everyone's best interest if the KFP SDK adds back the `add_node_affinity()` and `add_toleration()` so that data scientists/ML specialists can easily specify where to run each component instead of relying on more advanced MLOps solutions that require more and more Kubernetes knowledge. 


Love this idea? Give it a 👍.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

Description

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions