Early stopping implementation

/kind feature

To continue idea proposed here: https://github.com/kubeflow/katib/issues/692#issuecomment-626054238, we have to implement these steps:

1. I propose these API changes in [Suggestion](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/suggestions/v1beta1/suggestion_types.go):
```go

// SuggestionSpec defines the desired state of suggestion.
type SuggestionSpec struct {
	// Name of the algorithm that suggestion is used.
	AlgorithmName string `json:"algorithmName"`
       . . .
        // Name of the early stopping algorithm
        EarlyStoppingAlgorithmName string `json:"earlyStoppingAlgorithmName,omitempty"`
      . . .
}

// TrialAssignment is the assignment for one trial.
type TrialAssignment struct {
	// Suggestion results
	ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments,omitempty"`

	//Name of the suggestion
	Name string `json:"name,omitempty"`

        // Parameters for early stopping techniques
	// Contains parameter name, value and comparison type
	EarlyStoppingRules []common.EarlyStoppingRule `json:"earlyStoppingRules,omitempty"`
}
```
I propose these API changes in [Trial](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/trials/v1beta1/trial_types.go):
```go
type TrialSpec struct {
         . . . 
	// Key-value pairs for hyperparameters and assignment values.
	ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments"`

	// Parameters for early stopping techniques
	// Contains parameter name, value and comparison type
	EarlyStoppingRules []common.EarlyStoppingRule `json:"earlyStoppingRules,omitempty"`
        . . .
}
```
These API changes in [Common](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go):
```go
type EarlyStoppingRule struct {
	Name       string         `json:"name,omitempty"`
	Value      string         `json:"value,omitempty"`
	Comparison ComparisonType `json:"comparison,omitempty"`
}

type ComparisonType string

const (
	Equal   ComparisonType = "equal"
	Less    ComparisonType = "less"
	Greater ComparisonType = "greater"
)
```

@gaocegege @johnugeorge In Experiment API do we need to define [`EarlyStoppingSpec`](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go#L27) in ExperimentSpec or left as it is?

In [manager](https://github.com/kubeflow/katib/blob/master/pkg/apis/manager/v1beta1/api.proto) I propose these API changes:

```proto
service EarlyStopping {
    rpc GetEarlyStoppingRules(GetEarlyStoppingRulesRequest) returns (GetEarlyStoppingRulesReply);
}

message GetEarlyStoppingRulesRequest {
    Experiment experiment = 1;
    repeated Trial trials = 2; 
}

message GetEarlyStoppingRulesReply {
    message EarlyStoppingRules{
        repeated EarlyStoppingRule rules = 1;
    }
    repeated EarlyStoppingRules early_stopping_rules = 2;
}


message EarlyStoppingRule {
    string name = 1;
    string value = 2;
    ComparisonType comparison = 3;
}

 enum ComparisonType {
    EQUAL = 0;
    LESS = 1;
    GREATER = 2; 
}

message GetSuggestionsReply {
    message ParameterAssignments{
        repeated ParameterAssignment assignments = 1;
    }
    repeated ParameterAssignments parameter_assignments = 1;
    AlgorithmSpec algorithm = 2;
     message EarlyStoppingRules{
        repeated EarlyStoppingRule rules = 1;
    }
    repeated EarlyStoppingRules early_stopping_rules = 3;
}

```

2. To be more explicit, I think we should expose Early Stopping as separate service in Suggestion deployment. My suggestion is to have another container in addition to [`suggestion` container](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/suggestion/composer/composer.go#L147). To handle 2 services in one deployment, we can define multiple ports in one Kubernetes service.

Advantages:
- We don't need to control another resources other than Suggestion deployment
- We can use `katib-config` to define Early Stopping container image.
- User can explicitly see logs from Early Stopping algorithm

We can have this workflow:
Experiment is submitted -> `GetSuggestions` call -> `GetEarlyStoppingRules` call -> Trials are created.

In that workflow Algorithm can produce parameters and add them to algorithm settings before calling Early Stopping.

Also `GetSuggestions` call can directly generate early stopping parameters, can be used in **Hyperband** when we need to define resource to Early Stopped Trial, e.g num epochs.

3. `GetEarlyStoppingRules` returns parameter name, parameter value and comparison type. These parameters are injected in metrics collector args. For example: `-stop-rule num-epochs;equal;5 -stop-rule accuracy;less;0.77`

4. We extend training container command with this condition: 
```
if test -f '/var/log/katib/early-stopped'; then 
         echo 'Training Container was Early Stopped'; 
      else 
         echo 'Training Container was failed'; 
         exit 1;
fi; 
```

5. Once metrics collector finds all required stop-rules it creates `early-stopped` file and kills the main training process. I assume that format to parse stop rules metrics will be the same as [metrics collector filter](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go#L119-L123)

6. Katib controller should add `EARLY_STOPPED` status to the Trial. 
We can create another table or extend `observation_logs` in the Katib-DB to indicate that Trial was early stopped. 
In another way we can use `service EarlyStopping` to change Trial status. In that case, we should add service account to Suggestion deployment to be able to change Trial resource.
@gaocegege @johnugeorge Any other ideas how we can report this info to the controller?

Please let me know if I missed something.

/priority p0

/cc @johnugeorge @gaocegege 
/cc @jlewi 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Early stopping implementation #1330

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Early stopping implementation #1330

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions