Skip to content

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Mar 8, 2025

This PR implements the export API for Ray Train V1 state. This builds on top of #50622, which implements the export API for Ray Train V2.

Key Changes

  • Added export.py with conversion functions between Train V1 state models and Train (V2) state export protobuf
  • Updated TrainRunInfo and TrainWorkerInfo schemas with additional fields for compatibility:
    • Log file paths for controller and workers
      • Note that these point to the Ray worker stderr logs, rather than specific train logs.
    • Resource allocation information
    • Made worker status a required field
      • Note that it is always set as ACTIVE for now.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
)

TRAIN_SCHEMA_VERSION = 0
TRAIN_SCHEMA_VERSION = 1
Copy link
Contributor

@hongpeng-guo hongpeng-guo Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we decide when to update this value? maybe add a comment above, illustrating when to increment this value, i.e., upon x, y, z files being updated. Or is it just a differentiator between train v1 and v2?

Signed-off-by: Matthew Deng <[email protected]>
Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM in general.

Just to double check if there might be any log missing from the worker .out files, i.e., the UDF print()


core_context = ray.runtime_context.get_runtime_context()
controller_log_file_path = (
ray._private.worker.global_worker.get_err_file_path()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by default the python loggers write to process .err file. But it would be better to double check if we may lose anything that are unique to the .out file.

I think the udf print func may write to the .out file of the worker process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is just best effort.

@matthewdeng matthewdeng added the go add ONLY when ready to merge, run all tests label Mar 13, 2025
Copy link
Contributor

@nikitavemuri nikitavemuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@matthewdeng matthewdeng merged commit 31878c9 into ray-project:master Mar 14, 2025
6 checks passed
@matthewdeng matthewdeng deleted the v2/backport branch March 14, 2025 00:32
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
This PR implements the export API for Ray Train V1 state. This builds on
top of ray-project#50622, which implements
the export API for Ray Train V2.


## Key Changes

- Added `export.py` with conversion functions between Train V1 state
models and Train (V2) state export protobuf
- Updated `TrainRunInfo` and `TrainWorkerInfo` schemas with additional
fields for compatibility:
  - Log file paths for controller and workers
- Note that these point to the Ray worker stderr logs, rather than
specific train logs.
  - Resource allocation information
  - Made worker status a required field
     - Note that it is always set as ACTIVE for now.


Signed-off-by: Matthew Deng <[email protected]>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
This PR implements the export API for Ray Train V1 state. This builds on
top of ray-project#50622, which implements
the export API for Ray Train V2.

## Key Changes

- Added `export.py` with conversion functions between Train V1 state
models and Train (V2) state export protobuf
- Updated `TrainRunInfo` and `TrainWorkerInfo` schemas with additional
fields for compatibility:
  - Log file paths for controller and workers
- Note that these point to the Ray worker stderr logs, rather than
specific train logs.
  - Resource allocation information
  - Made worker status a required field
     - Note that it is always set as ACTIVE for now.

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants