Update trainer to ensure type consistency for `train_args` and `lora_config` #2181

helenxie-bit · 2024-07-22T21:56:17Z

What this PR does / why we need it:
Add data preprocessing for train_args and lora_config to ensure each parameter's type is consistent with the reference value. This will be necessary for developing the Katib tune API to optimize hyperparameters.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit · 2024-07-22T22:14:30Z

Detailed reason for this change:
We aim to reuse this trainer for the Katib LLM Hyperparameter Optimization API. Katib's controller substitutes hyperparameters with different values for each trial, and these values default to strings. This type inconsistency causes errors when running the trainer. Therefore, it is necessary to preprocess train_args and lora_config to ensure type consistency.

Example:
When optimizing the learning rate, users set the parameters:

learning_rate = katib.search.double(min=1e-05, max=5e-05),

Arguments passed to the training container become:

--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'

This leads to the following error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 196, in <module>
[rank0]:     train_model(model, transformer_type, train_data, eval_data, tokenizer, train_args)
[rank0]:   File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 147, in train_model
[rank0]:     trainer.train()
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1624, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1725, in _inner_training_loop
[rank0]:     self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 954, in create_optimizer_and_scheduler
[rank0]:     self.create_optimizer()
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1001, in create_optimizer
[rank0]:     self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/torch/optim/adamw.py", line 29, in __init__
[rank0]:     if not 0.0 <= lr:
[rank0]:            ^^^^^^^^^
[rank0]: TypeError: '<=' not supported between instances of 'float' and 'str'
E0722 14:52:04.854000 7957912640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 54960) of binary: /opt/homebrew/anaconda3/envs/katib-llm-test/bin/python

Signed-off-by: helenxie-bit <[email protected]>

coveralls · 2024-07-22T22:31:28Z

Pull Request Test Coverage Report for Build 10049294187

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on helenxie/update-trainer at 35.406%

Totals
Change from base Build 9999203579:	35.4%
Covered Lines:	4378
Relevant Lines:	12365

💛 - Coveralls

helenxie-bit · 2024-07-25T09:12:51Z

I built the image of this trainer on my local computer and tried to test my example for Katib LLM Hyperparameter Optimization API which utilizes this trainer, it kept showing the following two errors:

Error 1:

I0724 18:03:43.553156      83 main.go:143]   Traceback (most recent call last):
I0724 18:03:43.553167      83 main.go:143]   File "/app/hf_llm_training.py", line 169, in <module>
I0724 18:03:43.553219      83 main.go:143]     train_args = TrainingArguments(**json.loads(args.training_parameters))
I0724 18:03:43.553226      83 main.go:143]   File "<string>", line 123, in __init__
I0724 18:03:43.553332      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1528, in __post_init__
I0724 18:03:43.553968      83 main.go:143]     and (self.device.type != "cuda")
I0724 18:03:43.553974      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1995, in device
I0724 18:03:43.554210      83 main.go:143]     return self._setup_devices
I0724 18:03:43.554219      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 56, in __get__
I0724 18:03:43.554223      83 main.go:143]     cached = self.fget(obj)
I0724 18:03:43.554297      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1914, in _setup_devices
I0724 18:03:43.554645      83 main.go:143]     self.distributed_state = PartialState(cpu=True, backend=self.ddp_backend)
I0724 18:03:43.554655      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 275, in __init__
I0724 18:03:43.554693      83 main.go:143]     self.set_device()
I0724 18:03:43.554698      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 786, in set_device
I0724 18:03:43.554764      83 main.go:143]     device_module.set_device(self.device)
I0724 18:03:43.554769      83 main.go:143] AttributeError: module 'torch.cpu' has no attribute 'set_device'. Did you mean: '_device'?
I0724 18:03:48.279502      83 main.go:143] [2024-07-24 18:03:48,275] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 66) of binary: /usr/bin/python

Error 2:

0724 21:07:37.439131      69 main.go:143]    Traceback (most recent call last):
I0724 21:07:37.439150      69 main.go:143]   File "/app/hf_llm_training.py", line 9, in <module>
I0724 21:07:37.439158      69 main.go:143]     from peft import LoraConfig, get_peft_model
I0724 21:07:37.439161      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/__init__.py", line 22, in <module>
I0724 21:07:37.439166      69 main.go:143]     from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
I0724 21:07:37.439169      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 16, in <module>
I0724 21:07:37.439248      69 main.go:143]     from .peft_model import (
I0724 21:07:37.439258      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 22, in <module>
I0724 21:07:37.439262      69 main.go:143]     from accelerate import dispatch_model, infer_auto_device_map
I0724 21:07:37.439263      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py", line 16, in <module>
I0724 21:07:37.439268      69 main.go:143]     from .accelerator import Accelerator
I0724 21:07:37.439269      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 36, in <module>
I0724 21:07:37.439308      69 main.go:143]     
I0724 21:07:37.439336      69 main.go:143] from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
I0724 21:07:37.439342      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/checkpointing.py", line 24, in <module>
I0724 21:07:37.439346      69 main.go:143]     from .utils import (
I0724 21:07:37.439348      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/__init__.py", line 190, in <module>
I0724 21:07:37.439407      69 main.go:143]     from .bnb import has_4bit_bnb_layers, load_and_quantize_model
I0724 21:07:37.439412      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/bnb.py", line 29, in <module>
I0724 21:07:37.439432      69 main.go:143]     from ..big_modeling import dispatch_model, init_empty_weights
I0724 21:07:37.439437      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 24, in <module>
I0724 21:07:37.439468      69 main.go:143]     from .hooks import (
I0724 21:07:37.439475      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 30, in <module>
I0724 21:07:37.439497      69 main.go:143]     from .utils.other import recursive_getattr
I0724 21:07:37.439509      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/other.py", line 36, in <module>
I0724 21:07:37.439540      69 main.go:143]     from .transformer_engine import convert_model
I0724 21:07:37.439545      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/transformer_engine.py", line 21, in <module>
I0724 21:07:37.439564      69 main.go:143]     import transformer_engine.pytorch as te
I0724 21:07:37.439568      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
I0724 21:07:37.439572      69 main.go:143]     from .module import LayerNormLinear
I0724 21:07:37.439573      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
I0724 21:07:37.439594      69 main.go:143]     from .layernorm_linear import LayerNormLinear
I0724 21:07:37.439598      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 15, in <module>
I0724 21:07:37.439616      69 main.go:143]     from .. import cpp_extensions as tex
I0724 21:07:37.439620      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
I0724 21:07:37.439639      69 main.go:143]     from transformer_engine_extensions import *
I0724 21:07:37.439640      69 main.go:143] ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory
I0724 21:07:37.786588      69 main.go:143] E0724 21:07:37.786000 281473339512928 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /usr/bin/python

I guess the error came from the base image, so I updated the version of base image in the Dockerfile as FROM nvcr.io/nvidia/pytorch:24.06-py3, and it works perfect now.

I'm wondering if anyone else met the same problem.

shivaylamba

Looks good

nsingl00 · 2024-08-02T19:03:42Z

sdk/python/kubeflow/trainer/hf_llm_training.py

 def setup_peft_model(model, lora_config):
    # Set up the PEFT model
    lora_config = LoraConfig(**json.loads(lora_config))
+    reference_lora_config = LoraConfig()


Can you help me understand how would reference_lora_config will be populated with correct types with Katib tune API? I believe Katib arguments are coming with --lora_config which only populates lora_config

@nsingl00 The logic is quite similar to the process of TrainingArguments. When using the Katib tune API, lora_config will be populated with parameters passed through "--lora_config". However, these parameters might not have the correct types (e.g., r might be a string instead of an integer). This code ensures that after Katib populates lora_config, its attributes are converted to the correct types by comparing them with the default values in reference_lora_config. This ensures that lora_config is correctly configured with the appropriate types for further use. Does this clarify your question?

makes sense

nsingl00 · 2024-08-02T19:05:02Z

sdk/python/kubeflow/trainer/hf_llm_training.py

    args = parse_arguments()
    train_args = TrainingArguments(**json.loads(args.training_parameters))
+    reference_train_args = transformers.TrainingArguments(
+        output_dir=train_args.output_dir


why we are only passing train_args.output_dir. What about other parameters for TrainingArguments

Since output_dir is the only parameter in TrainingArguments that has no default value and must be set, other parameters are optional and have default values. These optional parameters will be automatically set to their default values when traversed. For detailed information, refer to this link: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments

nsingl00 · 2024-08-02T19:07:02Z

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?
Thoughts @andreyvelich @johnugeorge

andreyvelich · 2024-08-05T18:33:01Z

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?

I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs.
For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7

We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135

As you can see we don't do any type check before substitution.

@tenzen-y @johnugeorge Do you have any suggestion on the above ?

nsingl00 · 2024-08-06T00:55:29Z

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?

I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs. For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7

We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135

As you can see we don't do any type check before substitution.

@tenzen-y @johnugeorge Do you have any suggestion on the above ?

Ok I see. Its hard to do with Env variables.

helenxie-bit · 2024-08-07T09:31:11Z

/area gsoc

andreyvelich

Looks good, thank you @helenxie-bit!
@deepanker13 @johnugeorge Please check this change.
/lgtm
/assign @deepanker13 @johnugeorge

deepanker13 · 2024-08-08T11:42:33Z

Hi @helenxie-bit
How are you passing values to the training container in katib?
Is it possible to pass values like how they are being passed in training operator.
"--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),

helenxie-bit · 2024-08-08T14:53:37Z

Hi @helenxie-bit How are you passing values to the training container in katib? Is it possible to pass values like how they are being passed in training operator. "--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),

@deepanker13 Yes, I implemented it exactly the same way: https://github.com/kubeflow/katib/blob/61dc8ca1d9e8bec88c3ebc210c0e9b6b587f563a/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L672. However, there is a difference between how Katib and the Training Operator handle the arguments due to Katib's hyperparameter substitution.

For example, when optimizing learning_rate, the user would set the parameters like this:

trainer_parameters=HuggingFaceTrainerParams(
        training_parameters=transformers.TrainingArguments(
            ...
            learning_rate = katib.search.double(min=1e-05, max=5e-05),
            ...
        ),
       ...
    )

Katib applies hyperparameter substitution and uses json.dumps(train_parameters.training_parameters.to_dict()), resulting in:

..."learning_rate": "${trialParameters.learning_rate}", ...

The Katib controller then sets the value for each trial according to the suggestion, so the training container ultimately receives:

--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'

As you can see, the value of learning_rate is a string instead of float, which is why we need to add data preprocessing inside the trainer.

deepanker13 · 2024-08-12T02:39:56Z

/lgtm

deepanker13 · 2024-08-12T02:40:37Z

thanks @helenxie-bit

andreyvelich

Thank you for this @helenxie-bit!
/lgtm
/approve

google-oss-prow · 2024-08-12T12:26:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…config` (kubeflow/trainer#2181) * update-trainer Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * reformat with black Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit added 2 commits July 22, 2024 14:46

update-trainer

458e761

Signed-off-by: helenxie-bit <[email protected]>

fix typo

41e87f8

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from jinchihe and kuizhiqing July 22, 2024 21:56

google-oss-prow bot added the size/S label Jul 22, 2024

reformat with black

99a05f0

Signed-off-by: helenxie-bit <[email protected]>

shivaylamba reviewed Jul 26, 2024

View reviewed changes

nsingl00 reviewed Aug 2, 2024

View reviewed changes

google-oss-prow bot added the area/gsoc label Aug 7, 2024

andreyvelich reviewed Aug 7, 2024

View reviewed changes

google-oss-prow bot assigned deepanker13, johnugeorge and andreyvelich Aug 7, 2024

google-oss-prow bot added the lgtm label Aug 7, 2024

andreyvelich reviewed Aug 12, 2024

View reviewed changes

google-oss-prow bot added the approved label Aug 12, 2024

google-oss-prow bot merged commit 2561b52 into kubeflow:master Aug 12, 2024

helenxie-bit mentioned this pull request Aug 19, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs kubeflow/katib#2339

Closed

6 tasks

Update trainer to ensure type consistency for train_args and lora_config #2181

Update trainer to ensure type consistency for train_args and lora_config #2181

Uh oh!

Conversation

helenxie-bit commented Jul 22, 2024

Uh oh!

helenxie-bit commented Jul 22, 2024

Uh oh!

coveralls commented Jul 22, 2024

Pull Request Test Coverage Report for Build 10049294187

Details

💛 - Coveralls

Uh oh!

helenxie-bit commented Jul 25, 2024

Uh oh!

shivaylamba left a comment

Choose a reason for hiding this comment

Uh oh!

nsingl00 Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

helenxie-bit Aug 3, 2024

Choose a reason for hiding this comment

Uh oh!

nsingl00 Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

nsingl00 Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

helenxie-bit Aug 3, 2024

Choose a reason for hiding this comment

Uh oh!

nsingl00 commented Aug 2, 2024

Uh oh!

andreyvelich commented Aug 5, 2024

Uh oh!

nsingl00 commented Aug 6, 2024

Uh oh!

helenxie-bit commented Aug 7, 2024

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

deepanker13 commented Aug 8, 2024

Uh oh!

helenxie-bit commented Aug 8, 2024

Uh oh!

deepanker13 commented Aug 12, 2024

Uh oh!

deepanker13 commented Aug 12, 2024

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Aug 12, 2024

Uh oh!

Uh oh!

Update trainer to ensure type consistency for `train_args` and `lora_config` #2181

Update trainer to ensure type consistency for `train_args` and `lora_config` #2181