-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[train] train v1 export api #51177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] train v1 export api #51177
Conversation
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
) | ||
|
||
TRAIN_SCHEMA_VERSION = 0 | ||
TRAIN_SCHEMA_VERSION = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we decide when to update this value? maybe add a comment above, illustrating when to increment this value, i.e., upon x, y, z files being updated. Or is it just a differentiator between train v1 and v2?
Signed-off-by: Matthew Deng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM in general.
Just to double check if there might be any log missing from the worker .out
files, i.e., the UDF print()
|
||
core_context = ray.runtime_context.get_runtime_context() | ||
controller_log_file_path = ( | ||
ray._private.worker.global_worker.get_err_file_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by default the python loggers write to process .err
file. But it would be better to double check if we may lose anything that are unique to the .out
file.
I think the udf print func may write to the .out
file of the worker process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is just best effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
This PR implements the export API for Ray Train V1 state. This builds on top of ray-project#50622, which implements the export API for Ray Train V2. ## Key Changes - Added `export.py` with conversion functions between Train V1 state models and Train (V2) state export protobuf - Updated `TrainRunInfo` and `TrainWorkerInfo` schemas with additional fields for compatibility: - Log file paths for controller and workers - Note that these point to the Ray worker stderr logs, rather than specific train logs. - Resource allocation information - Made worker status a required field - Note that it is always set as ACTIVE for now. Signed-off-by: Matthew Deng <[email protected]>
This PR implements the export API for Ray Train V1 state. This builds on top of ray-project#50622, which implements the export API for Ray Train V2. ## Key Changes - Added `export.py` with conversion functions between Train V1 state models and Train (V2) state export protobuf - Updated `TrainRunInfo` and `TrainWorkerInfo` schemas with additional fields for compatibility: - Log file paths for controller and workers - Note that these point to the Ray worker stderr logs, rather than specific train logs. - Resource allocation information - Made worker status a required field - Note that it is always set as ACTIVE for now. Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: Dhakshin Suriakannu <[email protected]>
This PR implements the export API for Ray Train V1 state. This builds on top of #50622, which implements the export API for Ray Train V2.
Key Changes
export.py
with conversion functions between Train V1 state models and Train (V2) state export protobufTrainRunInfo
andTrainWorkerInfo
schemas with additional fields for compatibility:Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.