Skip to content

Grid Search stuck when parallelTrialCount < maxTrialCount #1534

@sidpalas

Description

@sidpalas

/kind bug

What steps did you take and what happened:

I created an experiment to perform a grid search, set maxTrialCount equal to the size of the grid, and set parallelTrialCount < maxTrialCount.

The first set of parallel trials completes successfully, but at some point after that the experiment gets stuck. The suggestion shows requested > assigned and has the following warning:
Warning ReconcileError 4s (x12 over 31s) suggestion-controller The response contains unexpected trials

What did you expect to happen:
Experiment to execute trials for each point in the parameter space and then complete.

Anything else you would like to add:

I am running this image for the katib-controller: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:latest (I also tested with gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0)

The issue seems like it is likely related to #1494

If I set parallelTrialCount >= maxTrialCount (enabling all runs to execute immediately) the experiment succeeds.

Environment:

  • Kubeflow version (kfctl version): v1.0.1
  • Minikube version (minikube version): N/A
  • Kubernetes version: (use kubectl version): v1.18.17-gke.100
  • OS (e.g. from /etc/os-release):

experiment.yaml:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: sid
  labels:
    controller-tools.k8s.io: "1.0"
  name: minimal-grid
spec:
  objective:
    type: maximize
    goal: 999999
    objectiveMetricName: value
  resumePolicy: Never
  algorithm:
    algorithmName: grid
  parallelTrialCount: 5
  maxTrialCount: 15
  maxFailedTrialCount: 3
  parameters:
    - name: --param_1
      parameterType: int
      feasibleSpace:
        min: "0"
        max: "15"
  trialTemplate:
    goTemplate:
      rawTemplate: |-
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: {{.Trial}}
          namespace: {{.NameSpace}}
        spec:
          template:
            spec:
              containers:
              - name: {{.Trial}}
                image: <IMAGE>
                command:
                - "python3"
                - "return_param_value.py"
                {{- with .HyperParameters}}
                {{- range .}}
                - "{{.Name}}={{.Value}}"
                {{- end}}
                {{- end}}
              restartPolicy: Never

Result of kubectl describe suggestion <suggestion-name>:

Name:         minimal-grid
Namespace:    sid
Labels:       controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Suggestion
Metadata:
  Creation Timestamp:  2021-05-18T05:58:39Z
  Generation:          9
  Managed Fields:
    API Version:  kubeflow.org/v1alpha3
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:labels:
          .:
          f:controller-tools.k8s.io:
        f:ownerReferences:
      f:spec:
        .:
        f:algorithmName:
        f:requests:
      f:status:
        .:
        f:conditions:
        f:startTime:
        f:suggestionCount:
        f:suggestions:
    Manager:    katib-controller
    Operation:  Update
    Time:       2021-05-18T05:58:56Z
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  minimal-grid
    UID:                   90e538b2-5858-45d2-af0e-38c0837d6a8d
  Resource Version:        69731263
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/sid/suggestions/minimal-grid
  UID:                     3875d508-b3ee-4426-88a7-1b307c381249
Spec:
  Algorithm Name:  grid
  Requests:        13
Status:
  Conditions:
    Last Transition Time:  2021-05-18T05:58:39Z
    Last Update Time:      2021-05-18T05:58:39Z
    Message:               Suggestion is created
    Reason:                SuggestionCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-05-18T05:58:54Z
    Last Update Time:      2021-05-18T05:58:54Z
    Message:               Deployment is ready
    Reason:                DeploymentReady
    Status:                True
    Type:                  DeploymentReady
    Last Transition Time:  2021-05-18T05:58:55Z
    Last Update Time:      2021-05-18T05:58:55Z
    Message:               Suggestion is running
    Reason:                SuggestionRunning
    Status:                True
    Type:                  Running
  Start Time:              2021-05-18T05:58:39Z
  Suggestion Count:        8
  Suggestions:
    Name:  minimal-grid-dq75pd5j
    Parameter Assignments:
      Name:   --param_1
      Value:  0
    Name:     minimal-grid-jsl78k9b
    Parameter Assignments:
      Name:   --param_1
      Value:  1
    Name:     minimal-grid-nbhlss7f
    Parameter Assignments:
      Name:   --param_1
      Value:  2
    Name:     minimal-grid-p87q65dk
    Parameter Assignments:
      Name:   --param_1
      Value:  3
    Name:     minimal-grid-dt2s5jd5
    Parameter Assignments:
      Name:   --param_1
      Value:  4
    Name:     minimal-grid-jxtrdnk9
    Parameter Assignments:
      Name:   --param_1
      Value:  6
    Name:     minimal-grid-l2t2gpc8
    Parameter Assignments:
      Name:   --param_1
      Value:  9
    Name:     minimal-grid-4dpdlj2v
    Parameter Assignments:
      Name:   --param_1
      Value:  13
Events:
  Type     Reason          Age               From                   Message
  ----     ------          ----              ----                   -------
  Warning  ReconcileError  0s (x8 over 15s)  suggestion-controller  The response contains unexpected trials

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions