[SDK] Train API #1962

deepanker13 · 2023-12-11T06:13:08Z

What this PR does / why we need it:

This pr contains the train api function which will be called by the user to run the training job.
Constants have been added to access them at multiple places.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Partially Fixes #1945

Checklist:

Docs included if any changes are user facing

coveralls · 2023-12-11T06:16:47Z

Pull Request Test Coverage Report for Build 7477774579

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.01%) to 42.896%

Totals
Change from base Build 7477136161:	0.01%
Covered Lines:	3756
Relevant Lines:	8756

💛 - Coveralls

deepanker13 · 2023-12-11T10:46:55Z

/hold depends on #1959 and #1958

deepanker13 · 2023-12-14T10:43:24Z

/hold cancel

andreyvelich · 2023-12-14T12:48:05Z

/assign @andreyvelich
@deepanker13 Please can you rebase this PR ?

andreyvelich

Thank you for this @deepanker13 !

sdk/python/kubeflow/training/constants/constants.py

sdk/python/kubeflow/training/api/training_client.py

deepanker13 · 2023-12-15T04:53:30Z

/assign @andreyvelich
@deepanker13 Please can you rebase this PR ?
done

sdk/python/kubeflow/training/api/training_client.py

sdk/python/kubeflow/training/constants/constants.py

sdk/python/kubeflow/storage_init_container/hugging_face.py

sdk/python/kubeflow/storage_init_container/storage.py

sdk/python/kubeflow/training/api/training_client.py

sdk/python/kubeflow/training/constants/constants.py

sdk/python/kubeflow/training/api/training_client.py

deepanker13 · 2024-01-10T10:19:34Z

@andreyvelich I have a reason to keep the download dir field, as there will be a single place where we define the default value and that same value will be passed through the code flow, else we will have to verify the values are same or not through the entire code flow.

andreyvelich · 2024-01-10T11:17:33Z

sdk/python/kubeflow/storage_init_container/hugging_face.py

@@ -1,40 +1,51 @@
-from abstract_model_provider import modelProvider


Let's name the directory storage_initiailizer rather than storage_init_container to be consistent with init container name ?
WDYT @deepanker13 ?

andreyvelich · 2024-01-10T11:23:33Z

@andreyvelich I have a reason to keep the download dir field, as there will be a single place where we define the default value and that same value will be passed through the code flow, else we will have to verify the values are same or not through the entire code flow.

@deepanker13 Can you just have 3 constant variables in storage_initializer/constants.py:

INIT_CONTAINER_MOUNT_PATH = "/workspace"
VOLUME_PATH_DATASET = INIT_CONTAINER_MOUNT_PATH + "/dataset"
VOLUME_PATH_MODEL = INIT_CONTAINER_MOUNT_PATH + "/model"

And then just re-use these constants in SDK and storage_initializer.
Does it work for you @deepanker13 ?

andreyvelich

I think, we are ready to merge this PR.
Thanks again @deepanker13 for all of this work!
/lgtm
/assign @johnugeorge @tenzen-y for the final review.

deepanker13 · 2024-01-10T16:17:38Z

I think, we are ready to merge this PR.
Thanks again @deepanker13 for all of this work!
/lgtm
/assign @johnugeorge @tenzen-y for the final review.

@andreyvelich @johnugeorge @tenzen-y thanks for all the help

johnugeorge · 2024-01-10T17:04:58Z

Awesome work @deepanker13
/lgtm
/approve

google-oss-prow · 2024-01-10T17:05:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* adding constant for init container name * github workflow fixes * removing constants file changes from this pr * code review changes * initial skeleton of train api * train api updated * fixes * code review changes * code review changes * code review changes * code review changes * fixing python library requirements * adding hugging face dataset download class * code review changes * fixing github workflow * code review comments * import fixes * integration test fix for python3.7 * torch version fix for python3.7 * removing unused variable * fixing library versions for python3.7 * removing alpine distribution * removing torch ad dependency * removing literal usage as python 3.7 doesn't support it * adding types.py * ci fix * storage init container changes, fixing imports * adding extra requires in setup.py, fixinf ci * adding commit to retrigger go test * renaming folder to storage initalizer * bug fix * removing extra gpu check as discussed with johnu * retriggering ci

google-oss-prow bot added the size/L label Dec 11, 2023

google-oss-prow bot requested review from jinchihe and kuizhiqing December 11, 2023 06:13

google-oss-prow bot added the do-not-merge/hold label Dec 11, 2023

deepanker13 force-pushed the train_api branch 2 times, most recently from ff86df6 to 8cbe61e Compare December 13, 2023 11:28

google-oss-prow bot removed the do-not-merge/hold label Dec 14, 2023

andreyvelich changed the title ~~Train api~~ [SDK] Train API Dec 14, 2023

google-oss-prow bot assigned andreyvelich Dec 14, 2023

andreyvelich reviewed Dec 14, 2023

View reviewed changes

deepanker13 force-pushed the train_api branch from 8cbe61e to c59b0b0 Compare December 15, 2023 04:52

andreyvelich reviewed Dec 15, 2023

View reviewed changes