Skip to content

Commit 55c4efa

Browse files
authored
[operator] fix TPU multi-host RayJob and RayCluster samples (#3733)
TPU v6e multihost requires at least `jax[tpu]==0.4.33`. This version doesn't support py39 which is the version used in `rayproject/ray:2.46.0`. So we use `rayproject/ray:2.46.0-py310` instead. Signed-off-by: David Xia <[email protected]>
1 parent 564524c commit 55c4efa

7 files changed

+16
-18
lines changed

ray-operator/config/samples/ray-cluster.tpu-v4-multihost.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ spec:
1313
spec:
1414
containers:
1515
- name: ray-head
16-
image: rayproject/ray:2.46.0
16+
image: rayproject/ray:2.46.0-py310
1717
imagePullPolicy: IfNotPresent
1818
resources:
1919
limits:
@@ -57,7 +57,7 @@ spec:
5757
spec:
5858
containers:
5959
- name: ray-worker
60-
image: rayproject/ray:2.46.0
60+
image: rayproject/ray:2.46.0-py310
6161
imagePullPolicy: IfNotPresent
6262
resources:
6363
limits:

ray-operator/config/samples/ray-cluster.tpu-v6e-16-multihost.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ spec:
1010
spec:
1111
containers:
1212
- name: ray-head
13-
image: rayproject/ray:2.46.0
13+
image: rayproject/ray:2.46.0-py310
1414
imagePullPolicy: IfNotPresent
1515
resources:
1616
limits:
@@ -41,7 +41,7 @@ spec:
4141
spec:
4242
containers:
4343
- name: ray-worker
44-
image: rayproject/ray:2.46.0
44+
image: rayproject/ray:2.46.0-py310
4545
imagePullPolicy: IfNotPresent
4646
resources:
4747
limits:

ray-operator/config/samples/ray-cluster.tpu-v6e-256-multihost.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ spec:
1010
spec:
1111
containers:
1212
- name: ray-head
13-
image: rayproject/ray:2.46.0
13+
image: rayproject/ray:2.46.0-py310
1414
imagePullPolicy: IfNotPresent
1515
resources:
1616
limits:
@@ -41,7 +41,7 @@ spec:
4141
spec:
4242
containers:
4343
- name: ray-worker
44-
image: rayproject/ray:2.46.0
44+
image: rayproject/ray:2.46.0-py310
4545
imagePullPolicy: IfNotPresent
4646
resources:
4747
limits:

ray-operator/config/samples/ray-job.tpu-v6e-16-multihost.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ spec:
77
runtimeEnvYAML: |
88
working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
99
pip:
10-
- jax[tpu]==0.4.33
10+
- jax[tpu]==0.6.1
1111
- -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
1212
rayClusterSpec:
1313
rayVersion: '2.46.0'
@@ -17,7 +17,7 @@ spec:
1717
spec:
1818
containers:
1919
- name: ray-head
20-
image: rayproject/ray:2.46.0
20+
image: rayproject/ray:2.46.0-py310
2121
ports:
2222
- containerPort: 6379
2323
name: gcs-server
@@ -47,7 +47,7 @@ spec:
4747
runAsUser: 0
4848
containers:
4949
- name: ray-worker
50-
image: rayproject/ray:2.46.0
50+
image: rayproject/ray:2.46.0-py310
5151
resources:
5252
limits:
5353
cpu: "24"

ray-operator/config/samples/ray-job.tpu-v6e-256-multihost.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ spec:
1717
spec:
1818
containers:
1919
- name: ray-head
20-
image: rayproject/ray:2.46.0
20+
image: rayproject/ray:2.46.0-py310
2121
ports:
2222
- containerPort: 6379
2323
name: gcs-server
@@ -46,7 +46,7 @@ spec:
4646
spec:
4747
containers:
4848
- name: ray-worker
49-
image: rayproject/ray:2.46.0
49+
image: rayproject/ray:2.46.0-py310
5050
resources:
5151
limits:
5252
cpu: "24"

ray-operator/config/samples/ray-job.tpu-v6e-singlehost.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ spec:
77
runtimeEnvYAML: |
88
working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
99
pip:
10-
- jax[tpu]==0.4.33
10+
- jax[tpu]==0.6.1
1111
- -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
1212
rayClusterSpec:
1313
rayVersion: '2.46.0'
@@ -17,7 +17,7 @@ spec:
1717
spec:
1818
containers:
1919
- name: ray-head
20-
image: rayproject/ray:2.46.0
20+
image: rayproject/ray:2.46.0-py310
2121
ports:
2222
- containerPort: 6379
2323
name: gcs-server
@@ -45,7 +45,7 @@ spec:
4545
runAsUser: 0
4646
containers:
4747
- name: ray-worker
48-
image: rayproject/ray:2.46.0
48+
image: rayproject/ray:2.46.0-py310
4949
resources:
5050
limits:
5151
cpu: "24"

ray-operator/config/samples/tpu/tpu_list_devices.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import os
22
import ray
33
import jax
4-
import time
54

65
from jax.experimental import multihost_utils
76

@@ -10,9 +9,8 @@
109
@ray.remote(resources={"TPU": 4})
1110
def tpu_cores():
1211
multihost_utils.sync_global_devices("sync")
13-
cores = "TPU cores:" + str(jax.device_count())
14-
print("TPU Worker: " + os.environ.get("TPU_WORKER_ID"))
15-
return cores
12+
print(f"TPU Worker: {os.environ.get('TPU_WORKER_ID')}")
13+
return f"TPU cores: {jax.device_count()}"
1614

1715
num_workers = int(ray.available_resources()["TPU"]) // 4
1816
print(f"Number of TPU Workers: {num_workers}")

0 commit comments

Comments
 (0)