Skip to content

Early stopping implementation #1330

@andreyvelich

Description

@andreyvelich

/kind feature

To continue idea proposed here: #692 (comment), we have to implement these steps:

  1. I propose these API changes in Suggestion:
// SuggestionSpec defines the desired state of suggestion.
type SuggestionSpec struct {
	// Name of the algorithm that suggestion is used.
	AlgorithmName string `json:"algorithmName"`
       . . .
        // Name of the early stopping algorithm
        EarlyStoppingAlgorithmName string `json:"earlyStoppingAlgorithmName,omitempty"`
      . . .
}

// TrialAssignment is the assignment for one trial.
type TrialAssignment struct {
	// Suggestion results
	ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments,omitempty"`

	//Name of the suggestion
	Name string `json:"name,omitempty"`

        // Parameters for early stopping techniques
	// Contains parameter name, value and comparison type
	EarlyStoppingRules []common.EarlyStoppingRule `json:"earlyStoppingRules,omitempty"`
}

I propose these API changes in Trial:

type TrialSpec struct {
         . . . 
	// Key-value pairs for hyperparameters and assignment values.
	ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments"`

	// Parameters for early stopping techniques
	// Contains parameter name, value and comparison type
	EarlyStoppingRules []common.EarlyStoppingRule `json:"earlyStoppingRules,omitempty"`
        . . .
}

These API changes in Common:

type EarlyStoppingRule struct {
	Name       string         `json:"name,omitempty"`
	Value      string         `json:"value,omitempty"`
	Comparison ComparisonType `json:"comparison,omitempty"`
}

type ComparisonType string

const (
	Equal   ComparisonType = "equal"
	Less    ComparisonType = "less"
	Greater ComparisonType = "greater"
)

@gaocegege @johnugeorge In Experiment API do we need to define EarlyStoppingSpec in ExperimentSpec or left as it is?

In manager I propose these API changes:

service EarlyStopping {
    rpc GetEarlyStoppingRules(GetEarlyStoppingRulesRequest) returns (GetEarlyStoppingRulesReply);
}

message GetEarlyStoppingRulesRequest {
    Experiment experiment = 1;
    repeated Trial trials = 2; 
}

message GetEarlyStoppingRulesReply {
    message EarlyStoppingRules{
        repeated EarlyStoppingRule rules = 1;
    }
    repeated EarlyStoppingRules early_stopping_rules = 2;
}


message EarlyStoppingRule {
    string name = 1;
    string value = 2;
    ComparisonType comparison = 3;
}

 enum ComparisonType {
    EQUAL = 0;
    LESS = 1;
    GREATER = 2; 
}

message GetSuggestionsReply {
    message ParameterAssignments{
        repeated ParameterAssignment assignments = 1;
    }
    repeated ParameterAssignments parameter_assignments = 1;
    AlgorithmSpec algorithm = 2;
     message EarlyStoppingRules{
        repeated EarlyStoppingRule rules = 1;
    }
    repeated EarlyStoppingRules early_stopping_rules = 3;
}
  1. To be more explicit, I think we should expose Early Stopping as separate service in Suggestion deployment. My suggestion is to have another container in addition to suggestion container. To handle 2 services in one deployment, we can define multiple ports in one Kubernetes service.

Advantages:

  • We don't need to control another resources other than Suggestion deployment
  • We can use katib-config to define Early Stopping container image.
  • User can explicitly see logs from Early Stopping algorithm

We can have this workflow:
Experiment is submitted -> GetSuggestions call -> GetEarlyStoppingRules call -> Trials are created.

In that workflow Algorithm can produce parameters and add them to algorithm settings before calling Early Stopping.

Also GetSuggestions call can directly generate early stopping parameters, can be used in Hyperband when we need to define resource to Early Stopped Trial, e.g num epochs.

  1. GetEarlyStoppingRules returns parameter name, parameter value and comparison type. These parameters are injected in metrics collector args. For example: -stop-rule num-epochs;equal;5 -stop-rule accuracy;less;0.77

  2. We extend training container command with this condition:

if test -f '/var/log/katib/early-stopped'; then 
         echo 'Training Container was Early Stopped'; 
      else 
         echo 'Training Container was failed'; 
         exit 1;
fi; 
  1. Once metrics collector finds all required stop-rules it creates early-stopped file and kills the main training process. I assume that format to parse stop rules metrics will be the same as metrics collector filter

  2. Katib controller should add EARLY_STOPPED status to the Trial.
    We can create another table or extend observation_logs in the Katib-DB to indicate that Trial was early stopped.
    In another way we can use service EarlyStopping to change Trial status. In that case, we should add service account to Suggestion deployment to be able to change Trial resource.
    @gaocegege @johnugeorge Any other ideas how we can report this info to the controller?

Please let me know if I missed something.

/priority p0

/cc @johnugeorge @gaocegege
/cc @jlewi

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions