Skip to content

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
Add data preprocessing for train_args and lora_config to ensure each parameter's type is consistent with the reference value. This will be necessary for developing the Katib tune API to optimize hyperparameters.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

Detailed reason for this change:
We aim to reuse this trainer for the Katib LLM Hyperparameter Optimization API. Katib's controller substitutes hyperparameters with different values for each trial, and these values default to strings. This type inconsistency causes errors when running the trainer. Therefore, it is necessary to preprocess train_args and lora_config to ensure type consistency.

Example:
When optimizing the learning rate, users set the parameters:

learning_rate = katib.search.double(min=1e-05, max=5e-05),

Arguments passed to the training container become:

--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'

This leads to the following error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 196, in <module>
[rank0]:     train_model(model, transformer_type, train_data, eval_data, tokenizer, train_args)
[rank0]:   File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 147, in train_model
[rank0]:     trainer.train()
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1624, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1725, in _inner_training_loop
[rank0]:     self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 954, in create_optimizer_and_scheduler
[rank0]:     self.create_optimizer()
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1001, in create_optimizer
[rank0]:     self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/torch/optim/adamw.py", line 29, in __init__
[rank0]:     if not 0.0 <= lr:
[rank0]:            ^^^^^^^^^
[rank0]: TypeError: '<=' not supported between instances of 'float' and 'str'
E0722 14:52:04.854000 7957912640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 54960) of binary: /opt/homebrew/anaconda3/envs/katib-llm-test/bin/python

Signed-off-by: helenxie-bit <[email protected]>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10049294187

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on helenxie/update-trainer at 35.406%

Totals Coverage Status
Change from base Build 9999203579: 35.4%
Covered Lines: 4378
Relevant Lines: 12365

💛 - Coveralls

@helenxie-bit
Copy link
Contributor Author

I built the image of this trainer on my local computer and tried to test my example for Katib LLM Hyperparameter Optimization API which utilizes this trainer, it kept showing the following two errors:

Error 1:

I0724 18:03:43.553156      83 main.go:143]   Traceback (most recent call last):
I0724 18:03:43.553167      83 main.go:143]   File "/app/hf_llm_training.py", line 169, in <module>
I0724 18:03:43.553219      83 main.go:143]     train_args = TrainingArguments(**json.loads(args.training_parameters))
I0724 18:03:43.553226      83 main.go:143]   File "<string>", line 123, in __init__
I0724 18:03:43.553332      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1528, in __post_init__
I0724 18:03:43.553968      83 main.go:143]     and (self.device.type != "cuda")
I0724 18:03:43.553974      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1995, in device
I0724 18:03:43.554210      83 main.go:143]     return self._setup_devices
I0724 18:03:43.554219      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 56, in __get__
I0724 18:03:43.554223      83 main.go:143]     cached = self.fget(obj)
I0724 18:03:43.554297      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1914, in _setup_devices
I0724 18:03:43.554645      83 main.go:143]     self.distributed_state = PartialState(cpu=True, backend=self.ddp_backend)
I0724 18:03:43.554655      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 275, in __init__
I0724 18:03:43.554693      83 main.go:143]     self.set_device()
I0724 18:03:43.554698      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 786, in set_device
I0724 18:03:43.554764      83 main.go:143]     device_module.set_device(self.device)
I0724 18:03:43.554769      83 main.go:143] AttributeError: module 'torch.cpu' has no attribute 'set_device'. Did you mean: '_device'?
I0724 18:03:48.279502      83 main.go:143] [2024-07-24 18:03:48,275] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 66) of binary: /usr/bin/python

Error 2:

0724 21:07:37.439131      69 main.go:143]    Traceback (most recent call last):
I0724 21:07:37.439150      69 main.go:143]   File "/app/hf_llm_training.py", line 9, in <module>
I0724 21:07:37.439158      69 main.go:143]     from peft import LoraConfig, get_peft_model
I0724 21:07:37.439161      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/__init__.py", line 22, in <module>
I0724 21:07:37.439166      69 main.go:143]     from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
I0724 21:07:37.439169      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 16, in <module>
I0724 21:07:37.439248      69 main.go:143]     from .peft_model import (
I0724 21:07:37.439258      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 22, in <module>
I0724 21:07:37.439262      69 main.go:143]     from accelerate import dispatch_model, infer_auto_device_map
I0724 21:07:37.439263      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py", line 16, in <module>
I0724 21:07:37.439268      69 main.go:143]     from .accelerator import Accelerator
I0724 21:07:37.439269      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 36, in <module>
I0724 21:07:37.439308      69 main.go:143]     
I0724 21:07:37.439336      69 main.go:143] from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
I0724 21:07:37.439342      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/checkpointing.py", line 24, in <module>
I0724 21:07:37.439346      69 main.go:143]     from .utils import (
I0724 21:07:37.439348      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/__init__.py", line 190, in <module>
I0724 21:07:37.439407      69 main.go:143]     from .bnb import has_4bit_bnb_layers, load_and_quantize_model
I0724 21:07:37.439412      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/bnb.py", line 29, in <module>
I0724 21:07:37.439432      69 main.go:143]     from ..big_modeling import dispatch_model, init_empty_weights
I0724 21:07:37.439437      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 24, in <module>
I0724 21:07:37.439468      69 main.go:143]     from .hooks import (
I0724 21:07:37.439475      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 30, in <module>
I0724 21:07:37.439497      69 main.go:143]     from .utils.other import recursive_getattr
I0724 21:07:37.439509      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/other.py", line 36, in <module>
I0724 21:07:37.439540      69 main.go:143]     from .transformer_engine import convert_model
I0724 21:07:37.439545      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/transformer_engine.py", line 21, in <module>
I0724 21:07:37.439564      69 main.go:143]     import transformer_engine.pytorch as te
I0724 21:07:37.439568      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
I0724 21:07:37.439572      69 main.go:143]     from .module import LayerNormLinear
I0724 21:07:37.439573      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
I0724 21:07:37.439594      69 main.go:143]     from .layernorm_linear import LayerNormLinear
I0724 21:07:37.439598      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 15, in <module>
I0724 21:07:37.439616      69 main.go:143]     from .. import cpp_extensions as tex
I0724 21:07:37.439620      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
I0724 21:07:37.439639      69 main.go:143]     from transformer_engine_extensions import *
I0724 21:07:37.439640      69 main.go:143] ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory
I0724 21:07:37.786588      69 main.go:143] E0724 21:07:37.786000 281473339512928 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /usr/bin/python

I guess the error came from the base image, so I updated the version of base image in the Dockerfile as FROM nvcr.io/nvidia/pytorch:24.06-py3, and it works perfect now.

I'm wondering if anyone else met the same problem.

Copy link

@shivaylamba shivaylamba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

def setup_peft_model(model, lora_config):
# Set up the PEFT model
lora_config = LoraConfig(**json.loads(lora_config))
reference_lora_config = LoraConfig()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how would reference_lora_config will be populated with correct types with Katib tune API? I believe Katib arguments are coming with --lora_config which only populates lora_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsingl00 The logic is quite similar to the process of TrainingArguments. When using the Katib tune API, lora_config will be populated with parameters passed through "--lora_config". However, these parameters might not have the correct types (e.g., r might be a string instead of an integer). This code ensures that after Katib populates lora_config, its attributes are converted to the correct types by comparing them with the default values in reference_lora_config. This ensures that lora_config is correctly configured with the appropriate types for further use. Does this clarify your question?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

args = parse_arguments()
train_args = TrainingArguments(**json.loads(args.training_parameters))
reference_train_args = transformers.TrainingArguments(
output_dir=train_args.output_dir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are only passing train_args.output_dir. What about other parameters for TrainingArguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since output_dir is the only parameter in TrainingArguments that has no default value and must be set, other parameters are optional and have default values. These optional parameters will be automatically set to their default values when traversed. For detailed information, refer to this link: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments

@nsingl00
Copy link

nsingl00 commented Aug 2, 2024

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?
Thoughts @andreyvelich @johnugeorge

@andreyvelich
Copy link
Member

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?

I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs.
For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7

We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135

As you can see we don't do any type check before substitution.

@tenzen-y @johnugeorge Do you have any suggestion on the above ?

@nsingl00
Copy link

nsingl00 commented Aug 6, 2024

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?

I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs. For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7

We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135

As you can see we don't do any type check before substitution.

@tenzen-y @johnugeorge Do you have any suggestion on the above ?

Ok I see. Its hard to do with Env variables.

@helenxie-bit
Copy link
Contributor Author

/area gsoc

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you @helenxie-bit!
@deepanker13 @johnugeorge Please check this change.
/lgtm
/assign @deepanker13 @johnugeorge

@deepanker13
Copy link
Contributor

Hi @helenxie-bit
How are you passing values to the training container in katib?
Is it possible to pass values like how they are being passed in training operator.
"--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),

@helenxie-bit
Copy link
Contributor Author

Hi @helenxie-bit How are you passing values to the training container in katib? Is it possible to pass values like how they are being passed in training operator. "--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),

@deepanker13 Yes, I implemented it exactly the same way: https://github.com/kubeflow/katib/blob/61dc8ca1d9e8bec88c3ebc210c0e9b6b587f563a/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L672. However, there is a difference between how Katib and the Training Operator handle the arguments due to Katib's hyperparameter substitution.

For example, when optimizing learning_rate, the user would set the parameters like this:

trainer_parameters=HuggingFaceTrainerParams(
        training_parameters=transformers.TrainingArguments(
            ...
            learning_rate = katib.search.double(min=1e-05, max=5e-05),
            ...
        ),
       ...
    )

Katib applies hyperparameter substitution and uses json.dumps(train_parameters.training_parameters.to_dict()), resulting in:

..."learning_rate": "${trialParameters.learning_rate}", ...

The Katib controller then sets the value for each trial according to the suggestion, so the training container ultimately receives:

--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'

As you can see, the value of learning_rate is a string instead of float, which is why we need to add data preprocessing inside the trainer.

@deepanker13
Copy link
Contributor

/lgtm

@deepanker13
Copy link
Contributor

thanks @helenxie-bit

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @helenxie-bit!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 2561b52 into kubeflow:master Aug 12, 2024
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
…config` (kubeflow/trainer#2181)

* update-trainer

Signed-off-by: helenxie-bit <[email protected]>

* fix typo

Signed-off-by: helenxie-bit <[email protected]>

* reformat with black

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
…config` (kubeflow/trainer#2181)

* update-trainer

Signed-off-by: helenxie-bit <[email protected]>

* fix typo

Signed-off-by: helenxie-bit <[email protected]>

* reformat with black

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 4, 2025
…config` (kubeflow/trainer#2181)

* update-trainer

Signed-off-by: helenxie-bit <[email protected]>

* fix typo

Signed-off-by: helenxie-bit <[email protected]>

* reformat with black

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
szaher pushed a commit to szaher/sdk that referenced this pull request Jun 5, 2025
…config` (kubeflow/trainer#2181)

* update-trainer

Signed-off-by: helenxie-bit <[email protected]>

* fix typo

Signed-off-by: helenxie-bit <[email protected]>

* reformat with black

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants