-
Notifications
You must be signed in to change notification settings - Fork 486
Description
/kind feature
To continue idea proposed here: #692 (comment), we have to implement these steps:
- I propose these API changes in Suggestion:
// SuggestionSpec defines the desired state of suggestion.
type SuggestionSpec struct {
// Name of the algorithm that suggestion is used.
AlgorithmName string `json:"algorithmName"`
. . .
// Name of the early stopping algorithm
EarlyStoppingAlgorithmName string `json:"earlyStoppingAlgorithmName,omitempty"`
. . .
}
// TrialAssignment is the assignment for one trial.
type TrialAssignment struct {
// Suggestion results
ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments,omitempty"`
//Name of the suggestion
Name string `json:"name,omitempty"`
// Parameters for early stopping techniques
// Contains parameter name, value and comparison type
EarlyStoppingRules []common.EarlyStoppingRule `json:"earlyStoppingRules,omitempty"`
}
I propose these API changes in Trial:
type TrialSpec struct {
. . .
// Key-value pairs for hyperparameters and assignment values.
ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments"`
// Parameters for early stopping techniques
// Contains parameter name, value and comparison type
EarlyStoppingRules []common.EarlyStoppingRule `json:"earlyStoppingRules,omitempty"`
. . .
}
These API changes in Common:
type EarlyStoppingRule struct {
Name string `json:"name,omitempty"`
Value string `json:"value,omitempty"`
Comparison ComparisonType `json:"comparison,omitempty"`
}
type ComparisonType string
const (
Equal ComparisonType = "equal"
Less ComparisonType = "less"
Greater ComparisonType = "greater"
)
@gaocegege @johnugeorge In Experiment API do we need to define EarlyStoppingSpec
in ExperimentSpec or left as it is?
In manager I propose these API changes:
service EarlyStopping {
rpc GetEarlyStoppingRules(GetEarlyStoppingRulesRequest) returns (GetEarlyStoppingRulesReply);
}
message GetEarlyStoppingRulesRequest {
Experiment experiment = 1;
repeated Trial trials = 2;
}
message GetEarlyStoppingRulesReply {
message EarlyStoppingRules{
repeated EarlyStoppingRule rules = 1;
}
repeated EarlyStoppingRules early_stopping_rules = 2;
}
message EarlyStoppingRule {
string name = 1;
string value = 2;
ComparisonType comparison = 3;
}
enum ComparisonType {
EQUAL = 0;
LESS = 1;
GREATER = 2;
}
message GetSuggestionsReply {
message ParameterAssignments{
repeated ParameterAssignment assignments = 1;
}
repeated ParameterAssignments parameter_assignments = 1;
AlgorithmSpec algorithm = 2;
message EarlyStoppingRules{
repeated EarlyStoppingRule rules = 1;
}
repeated EarlyStoppingRules early_stopping_rules = 3;
}
- To be more explicit, I think we should expose Early Stopping as separate service in Suggestion deployment. My suggestion is to have another container in addition to
suggestion
container. To handle 2 services in one deployment, we can define multiple ports in one Kubernetes service.
Advantages:
- We don't need to control another resources other than Suggestion deployment
- We can use
katib-config
to define Early Stopping container image. - User can explicitly see logs from Early Stopping algorithm
We can have this workflow:
Experiment is submitted -> GetSuggestions
call -> GetEarlyStoppingRules
call -> Trials are created.
In that workflow Algorithm can produce parameters and add them to algorithm settings before calling Early Stopping.
Also GetSuggestions
call can directly generate early stopping parameters, can be used in Hyperband when we need to define resource to Early Stopped Trial, e.g num epochs.
-
GetEarlyStoppingRules
returns parameter name, parameter value and comparison type. These parameters are injected in metrics collector args. For example:-stop-rule num-epochs;equal;5 -stop-rule accuracy;less;0.77
-
We extend training container command with this condition:
if test -f '/var/log/katib/early-stopped'; then
echo 'Training Container was Early Stopped';
else
echo 'Training Container was failed';
exit 1;
fi;
-
Once metrics collector finds all required stop-rules it creates
early-stopped
file and kills the main training process. I assume that format to parse stop rules metrics will be the same as metrics collector filter -
Katib controller should add
EARLY_STOPPED
status to the Trial.
We can create another table or extendobservation_logs
in the Katib-DB to indicate that Trial was early stopped.
In another way we can useservice EarlyStopping
to change Trial status. In that case, we should add service account to Suggestion deployment to be able to change Trial resource.
@gaocegege @johnugeorge Any other ideas how we can report this info to the controller?
Please let me know if I missed something.
/priority p0
/cc @johnugeorge @gaocegege
/cc @jlewi