Monitor comm cost #40

yy462 · 2025-04-03T02:07:33Z

add communication monitoring during initialization in NC, GC, LP

quickstart.py

Ryan-YuanLi · 2025-04-03T06:00:30Z

fedgraph/federated_methods.py

+    if args.use_cluster:
+        monitor = monitor or Monitor()
+        monitor.init_time_end()
+


monitor should only been initialized at same place, if not monitor throw an error and print error info

Yea, it is correct, the reason why I do this is that I wanted to ensure monitor is always defined when use_cluster is true, in case it’s not passed correctly.
Gonna throw an error and print error info.

Ryan-YuanLi · 2025-04-03T06:00:57Z

fedgraph/federated_methods.py

-    # Append paths relative to the current script's directory
+    # Use monitor passed from run_fedgraph
+    if args.use_cluster:
+        monitor = args.monitor


monitor is an instance not a args

The reason why I tried this, cause if we round the project in local, there will be some error when we pass monitor, but I will change it to the original version if we will only run the project in cluster in the future. And I also found we repeatly initial Monitor in some parts of our original code. Like In the helper functions such as run_GC_selftrain, run_GC_Fed_algorithm and run_GCFL_algorithm, we still create a new Monitor instance each time. Should we need to change this

Ryan-YuanLi · 2025-04-03T06:02:24Z

fedgraph/federated_methods.py

        run_LP(args)

+    # End total communication timing
+    monitor.total_comm_time_end()


no idea why we need total_comm_time

OK, already delete all the code that related total_comm_time. I think I misunderstand what you guys required. For now, I just calculate each initialization time for each run_NC, run_GC, run_LP.

Ryan-YuanLi · 2025-04-03T06:02:57Z

fedgraph/federated_methods.py

+    monitor.init_time_start()
+    monitor.total_comm_time_start()
+
+    args.monitor = monitor


dont do this way to pass with args

Just delete it and change it back

Ryan-YuanLi · 2025-04-03T06:03:18Z

distributed_job.py

+
+ray.init(address="auto")
+
+


delete this file

Already deleted.

Ryan-YuanLi · 2025-04-03T06:04:57Z

fedgraph/federated_methods.py

    if server.use_cluster:
-        monitor = Monitor()
+        # Use monitor from server
+        monitor = server.monitor if hasattr(server, "monitor") else Monitor()


only initialize at one place

… tracking from cluster-specific metrics.

Ryan-YuanLi · 2025-04-17T00:21:52Z

fedgraph/federated_methods.py

    if args.fedgraph_task == "NC":
-        run_NC(args, data)
+        run_NC(args, data, monitor)
    elif args.fedgraph_task == "GC":
-        run_GC(args, data)
+        run_GC(args, data, monitor)
    elif args.fedgraph_task == "LP":
-        run_LP(args)
+        run_LP(args, monitor)



remove this, always init inside of the function

Yea, that make sense, thanks for the advice. Already fix it.

Ryan-YuanLi · 2025-04-17T00:22:26Z

fedgraph/federated_methods.py

        [trainer.relabel_adj.remote() for trainer in server.trainers]
-    if args.use_cluster:
-        monitor.pretrain_time_end(30)
-        monitor.train_time_start()
+
+    monitor.pretrain_time_end(30)
+    monitor.train_time_start()
    #######################################################################


we could only sleep 30s when use_cluster

Ryan-YuanLi · 2025-04-17T00:23:03Z

fedgraph/federated_methods.py


-    if server.use_cluster:
+    if monitor is not None:
        monitor.train_time_end(30)
    fs = frame.style.apply(highlight_max).data


init inside the function so it wont be none, too many if

Ryan-YuanLi · 2025-04-17T00:23:33Z

fedgraph/federated_methods.py

    current_dir = os.path.dirname(os.path.abspath(__file__))
    ray.init()
-    if args.use_cluster:
-        # Initialize monitor and start tracking initialization time
-        monitor = Monitor()
+    if args.use_cluster and monitor is not None:
        monitor.init_time_start()
+
    # Append paths relative to the current script's directory


same issue, still use args.use_cluster

Ryan-YuanLi · 2025-04-17T21:06:01Z

fedgraph/federated_methods.py


-    monitor.pretrain_time_end(30)
+    monitor.pretrain_time_end(30 if args.use_cluster else 0)
    monitor.train_time_start()


too hack, could you do this if else inside of the monitor

Ryan-YuanLi · 2025-04-27T04:59:40Z

fedgraph/federated_methods.py

+            args = kwds.get("args", {})
+            self.use_encryption = (
+                getattr(args, "use_encryption", False)
+                if hasattr(args, "use_encryption")


can lead to confusion, as the original args are lost after reassignment

Oh, thanks for the remind. I've renamed the local variable to args_obj to avoid shadowing *args, ensuring clarity and safety in this part.

Ryan-YuanLi · 2025-04-27T05:00:05Z

ray_cluster_configs/eks_cluster_config.yaml.bak

+kind: ClusterConfig
+
+metadata:
+  name: mlarge-1739510276
+  region: us-east-1
+
+nodeGroups:
+  - name: head-nodes
+    instanceType: m5.24xlarge
+    desiredCapacity: 1
+    minSize: 0
+    maxSize: 1
+    volumeSize: 256
+    labels:
+      ray-node-type: head
+
+  - name: worker-nodes
+    instanceType: m5.16xlarge
+    desiredCapacity: 10
+    minSize: 10
+    maxSize: 10


why there's a bak

The .bak file is a backup automatically created by Ray when updating the EKS cluster config to prevent accidental loss. It’s safe to ignore or delete if version control is used.

Ryan-YuanLi · 2025-04-27T05:01:07Z

setup_cluster.sh

+    "pip": ["fsspec==2023.6.0", "huggingface_hub", "tenseal"]
+  }' \


"fsspec==2023.6.0" may be too strict

Yea, I think we can relax the version constraint to just "fsspec" to avoid potential compatibility issues.

… and EKS configs

Ryan-YuanLi

approve

Ryan-YuanLi · 2025-05-15T03:03:12Z

fedgraph/federated_methods.py

+                print(
+                    f"[Debug] Trainer running on node IP: {ray.util.get_node_ip_address()}"
+                )

        clients = [


just wonder is there some cases that trainers could run on the same ip because they are scheduled to the same pod by ray?

Yes, it’s possible — if trainers’ resource demands are small and Ray schedules multiple trainers onto the same node or pod, they will share the same IP.

yuyang2S2023 and others added 10 commits January 6, 2025 18:10

update the downloading path of cora dataset

0bd522a

Add ogbn dataset

aad0d7d

Update data_process.py

cf88bb7

Fix issues detected by pre-commit

9ecb93b

Add setup_cluster.sh script and update README for Ray cluster setup

23b65c1

Update readme for setup cluster

768c56a

add setup custer doc

aa3c6df

update config paths, and adjust Ray cluster setup

7e7ebd6

adjust the structure of the folder and finish testing the code

b24f0f8

feat: add communication monitoring during initialization in NC, GC, LP

1c03885

yy462 requested review from Ryan-YuanLi and yh-yao April 3, 2025 02:07

Merge branch 'main' into monitor-comm-cost

86254d7

Ryan-YuanLi requested changes Apr 3, 2025

View reviewed changes

fix: pre-commit formatting

2955b52

Ryan-YuanLi approved these changes Apr 4, 2025

View reviewed changes

Moved use_cluster check inside Monitor class to cleanly separate time…

05f85d5

… tracking from cluster-specific metrics.

Ryan-YuanLi reviewed Apr 17, 2025

View reviewed changes

Fix some parts based on the feedback

7bf2d21

Ryan-YuanLi requested changes Apr 17, 2025

View reviewed changes

Fix if else inside of the monitor

7c3e8bd

Ryan-YuanLi approved these changes Apr 17, 2025

View reviewed changes

Update setup_cluster.sh and test all benchmark

2d751b5

Ryan-YuanLi reviewed Apr 27, 2025

View reviewed changes

yuyang2S2023 added 5 commits May 1, 2025 18:41

add Theoretical Comm Cost

e0b4dd2

Update pretrain comm cost for NC

5670080

Add comm cost extraction and visualization scripts, update benchmarks…

88ec66d

… and EKS configs

update configure and figure

3ce50e3

delete redundant fils

dfcd478

Ryan-YuanLi approved these changes May 15, 2025

View reviewed changes

yuyang2S2023 added 3 commits May 15, 2025 16:09

Add PDF output for plots and update extract log scripts

aa94ab2

Update figures

0794e40

Add accuracy figures and extract script

278d3ca

yy462 merged commit 96ed93d into main May 22, 2025
2 checks passed

yh-yao deleted the monitor-comm-cost branch September 18, 2025 20:17

		"pip": ["fsspec==2023.6.0", "huggingface_hub", "tenseal"]
		}' \


		ray.init(address="auto")

Monitor comm cost #40

Monitor comm cost #40

Uh oh!

Conversation

yy462 commented Apr 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yy462 Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yy462 Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ryan-YuanLi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yy462 Apr 4, 2025 •

edited

Loading

yy462 Apr 17, 2025 •

edited

Loading