[train] train v1 export api #51177

matthewdeng · 2025-03-08T03:00:51Z

This PR implements the export API for Ray Train V1 state. This builds on top of #50622, which implements the export API for Ray Train V2.

Key Changes

Added export.py with conversion functions between Train V1 state models and Train (V2) state export protobuf
Updated TrainRunInfo and TrainWorkerInfo schemas with additional fields for compatibility:
- Log file paths for controller and workers
  - Note that these point to the Ray worker stderr logs, rather than specific train logs.
- Resource allocation information
- Made worker status a required field
  - Note that it is always set as ACTIVE for now.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Deng <[email protected]>

python/ray/train/_internal/backend_executor.py

python/ray/train/_internal/state/state_manager.py

hongpeng-guo · 2025-03-13T21:35:53Z

python/ray/train/v2/_internal/state/export.py

 )

-TRAIN_SCHEMA_VERSION = 0
+TRAIN_SCHEMA_VERSION = 1


how do we decide when to update this value? maybe add a comment above, illustrating when to increment this value, i.e., upon x, y, z files being updated. Or is it just a differentiator between train v1 and v2?

Signed-off-by: Matthew Deng <[email protected]>

hongpeng-guo

Nice! LGTM in general.

Just to double check if there might be any log missing from the worker .out files, i.e., the UDF print()

hongpeng-guo · 2025-03-13T21:44:10Z

python/ray/train/_internal/backend_executor.py


            core_context = ray.runtime_context.get_runtime_context()
+            controller_log_file_path = (
+                ray._private.worker.global_worker.get_err_file_path()


I think by default the python loggers write to process .err file. But it would be better to double check if we may lose anything that are unique to the .out file.

I think the udf print func may write to the .out file of the worker process.

Yep this is just best effort.

nikitavemuri

looks good!

This PR implements the export API for Ray Train V1 state. This builds on top of ray-project#50622, which implements the export API for Ray Train V2. ## Key Changes - Added `export.py` with conversion functions between Train V1 state models and Train (V2) state export protobuf - Updated `TrainRunInfo` and `TrainWorkerInfo` schemas with additional fields for compatibility: - Log file paths for controller and workers - Note that these point to the Ray worker stderr logs, rather than specific train logs. - Resource allocation information - Made worker status a required field - Note that it is always set as ACTIVE for now. Signed-off-by: Matthew Deng <[email protected]>

This PR implements the export API for Ray Train V1 state. This builds on top of ray-project#50622, which implements the export API for Ray Train V2. ## Key Changes - Added `export.py` with conversion functions between Train V1 state models and Train (V2) state export protobuf - Updated `TrainRunInfo` and `TrainWorkerInfo` schemas with additional fields for compatibility: - Log file paths for controller and workers - Note that these point to the Ray worker stderr logs, rather than specific train logs. - Resource allocation information - Made worker status a required field - Note that it is always set as ACTIVE for now. Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: Dhakshin Suriakannu <[email protected]>

matthewdeng added 6 commits March 7, 2025 19:00

[train] train v1 export api

17e68ff

Signed-off-by: Matthew Deng <[email protected]>

test

8ebb6c4

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into v2/backport

50c83ec

Signed-off-by: Matthew Deng <[email protected]>

fix

5f86934

Signed-off-by: Matthew Deng <[email protected]>

fix

9ba0d6e

Signed-off-by: Matthew Deng <[email protected]>

fix

9a227c6

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng assigned nikitavemuri and hongpeng-guo Mar 13, 2025

matthewdeng requested a review from nikitavemuri March 13, 2025 21:00

matthewdeng marked this pull request as ready for review March 13, 2025 21:01

matthewdeng requested review from hongpeng-guo, justinvyu, raulchen and woshiyyya as code owners March 13, 2025 21:01

hongpeng-guo reviewed Mar 13, 2025

View reviewed changes

python/ray/train/_internal/backend_executor.py Show resolved Hide resolved

hongpeng-guo reviewed Mar 13, 2025

View reviewed changes

python/ray/train/_internal/state/state_manager.py Show resolved Hide resolved

hongpeng-guo reviewed Mar 13, 2025

View reviewed changes

BUILD

9b9325d

Signed-off-by: Matthew Deng <[email protected]>

hongpeng-guo approved these changes Mar 13, 2025

View reviewed changes

hongpeng-guo reviewed Mar 13, 2025

View reviewed changes

matthewdeng added the go add ONLY when ready to merge, run all tests label Mar 13, 2025

nikitavemuri approved these changes Mar 14, 2025

View reviewed changes

matthewdeng merged commit 31878c9 into ray-project:master Mar 14, 2025
6 checks passed

matthewdeng deleted the v2/backport branch March 14, 2025 00:32

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] train v1 export api #51177

[train] train v1 export api #51177

Uh oh!

matthewdeng commented Mar 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

hongpeng-guo Mar 13, 2025 •

edited

Loading

Uh oh!

hongpeng-guo left a comment •

edited

Loading

Uh oh!

hongpeng-guo Mar 13, 2025

Uh oh!

matthewdeng Mar 13, 2025

Uh oh!

nikitavemuri left a comment

Uh oh!

Uh oh!

Uh oh!

[train] train v1 export api #51177

[train] train v1 export api #51177

Uh oh!

Conversation

matthewdeng commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

hongpeng-guo Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongpeng-guo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongpeng-guo Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

matthewdeng Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

nikitavemuri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthewdeng commented Mar 8, 2025 •

edited

Loading

hongpeng-guo Mar 13, 2025 •

edited

Loading

hongpeng-guo left a comment •

edited

Loading