-
Notifications
You must be signed in to change notification settings - Fork 829
Update trainer to ensure type consistency for train_args
and lora_config
#2181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update trainer to ensure type consistency for train_args
and lora_config
#2181
Conversation
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Detailed reason for this change: Example:
Arguments passed to the training container become:
This leads to the following error:
|
Signed-off-by: helenxie-bit <[email protected]>
Pull Request Test Coverage Report for Build 10049294187Details
💛 - Coveralls |
I built the image of this trainer on my local computer and tried to test my example for Katib LLM Hyperparameter Optimization API which utilizes this trainer, it kept showing the following two errors: Error 1:
Error 2:
I guess the error came from the base image, so I updated the version of base image in the Dockerfile as I'm wondering if anyone else met the same problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
def setup_peft_model(model, lora_config): | ||
# Set up the PEFT model | ||
lora_config = LoraConfig(**json.loads(lora_config)) | ||
reference_lora_config = LoraConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand how would reference_lora_config
will be populated with correct types with Katib tune API? I believe Katib arguments are coming with --lora_config
which only populates lora_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsingl00 The logic is quite similar to the process of TrainingArguments
. When using the Katib tune API, lora_config
will be populated with parameters passed through "--lora_config". However, these parameters might not have the correct types (e.g., r
might be a string instead of an integer). This code ensures that after Katib populates lora_config
, its attributes are converted to the correct types by comparing them with the default values in reference_lora_config
. This ensures that lora_config
is correctly configured with the appropriate types for further use. Does this clarify your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
args = parse_arguments() | ||
train_args = TrainingArguments(**json.loads(args.training_parameters)) | ||
reference_train_args = transformers.TrainingArguments( | ||
output_dir=train_args.output_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are only passing train_args.output_dir. What about other parameters for TrainingArguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since output_dir
is the only parameter in TrainingArguments
that has no default value and must be set, other parameters are optional and have default values. These optional parameters will be automatically set to their default values when traversed. For detailed information, refer to this link: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments
Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer? |
I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs. We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135 As you can see we don't do any type check before substitution. @tenzen-y @johnugeorge Do you have any suggestion on the above ? |
Ok I see. Its hard to do with Env variables. |
/area gsoc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you @helenxie-bit!
@deepanker13 @johnugeorge Please check this change.
/lgtm
/assign @deepanker13 @johnugeorge
Hi @helenxie-bit |
@deepanker13 Yes, I implemented it exactly the same way: https://github.com/kubeflow/katib/blob/61dc8ca1d9e8bec88c3ebc210c0e9b6b587f563a/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L672. However, there is a difference between how Katib and the Training Operator handle the arguments due to Katib's hyperparameter substitution. For example, when optimizing trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
...
learning_rate = katib.search.double(min=1e-05, max=5e-05),
...
),
...
) Katib applies hyperparameter substitution and uses ..."learning_rate": "${trialParameters.learning_rate}", ... The Katib controller then sets the value for each trial according to the suggestion, so the training container ultimately receives: --training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}' As you can see, the value of |
/lgtm |
thanks @helenxie-bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @helenxie-bit!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…config` (kubeflow/trainer#2181) * update-trainer Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * reformat with black Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>
…config` (kubeflow/trainer#2181) * update-trainer Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * reformat with black Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>
…config` (kubeflow/trainer#2181) * update-trainer Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * reformat with black Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>
…config` (kubeflow/trainer#2181) * update-trainer Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * reformat with black Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>
What this PR does / why we need it:
Add data preprocessing for
train_args
andlora_config
to ensure each parameter's type is consistent with the reference value. This will be necessary for developing the Katibtune
API to optimize hyperparameters.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: