Skip to content

Standalone Katib deployment: creating a new experiment fails due to the MutatingWebhook timing out #1258

@kylepad

Description

@kylepad

/kind bug

Hi I am trying to setup a standalone deployment of Katib v1beta1 on GKE. It is very possible I am doing something totally wrong here, or missing something obvious and that's why I'm coming to you for help.

TL;DR: I have tried deploying katib via the deploy.sh script and also via Terraform. No matter how I deploy everything seems fine, all of katib-controller, katib-db-manager, katib-mysql and katib-ui pods are up and running, logs look clean. Then I try and submit one of the sample experiments and I get this timeout error:

$ kubectl apply -f examples/v1beta1/grid-example.yaml 
    Error from server (InternalError): error when creating "examples/v1beta1/grid-example.yaml": 
    Internal error occurred: failed calling webhook "mutating.experiment.katib.kubeflow.org": 
    Post https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s: context deadline exceeded

What I've tried:

I tried following all the debugging steps in #1160 (closest to this issue AFAIK) and didn't really get anywhere.
The webhook itself seems to be setup (admittedly I cant say if it's correct or not):

$ kubectl describe MutatingWebhookConfiguration katib-mutating-webhook-config
Name:         katib-mutating-webhook-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  admissionregistration.k8s.io/v1beta1
Kind:         MutatingWebhookConfiguration
Metadata:
  Creation Timestamp:  2020-07-09T00:24:58Z
  Generation:          1
  Resource Version:    17496
  Self Link:           /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/katib-mutating-webhook-config
  UID:                 9eaca06d-c17a-11ea-ac7b-42010a000066
Webhooks:
  Admission Review Versions:
    v1beta1
  Client Config:
    Ca Bundle:  <hidden>
    Service:
      Name:        katib-controller
      Namespace:   kubeflow
      Path:        /mutate-experiments
  Failure Policy:  Fail
  Name:            mutating.experiment.katib.kubeflow.org
  Namespace Selector:
    Match Expressions:
      Key:       control-plane
      Operator:  DoesNotExist
  Rules:
    API Groups:
      kubeflow.org
    API Versions:
      v1beta1
    Operations:
      CREATE
      UPDATE
    Resources:
      experiments
    Scope:          *
  Side Effects:     Unknown
  Timeout Seconds:  30
  Admission Review Versions:
    v1beta1
  Client Config:
    Ca Bundle: <hidden>
    Service:
      Name:        katib-controller
      Namespace:   kubeflow
      Path:        /mutate-pods
  Failure Policy:  Ignore
  Name:            mutating.pod.katib.kubeflow.org
  Namespace Selector:
    Match Labels:
      Katib - Metricscollector - Injection:  enabled
  Rules:
    API Groups:
      
    API Versions:
      v1
    Operations:
      CREATE
    Resources:
      pods
    Scope:          *
  Side Effects:     Unknown
  Timeout Seconds:  30
Events:             <none>

I've tried multiple Kubernetes versions from: 1.14.10-gke.45 - 1.16.9-gke.6

Some things I have not tried:

  • installing via the GCP kubeflow script (the whole point is to get a standalone katib deployment)
  • installing v1alpha3 (not against it, but ideally we want the newest version)

Any and all help is appreciated! 😄

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions