-
Notifications
You must be signed in to change notification settings - Fork 5
Monitor comm cost #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fedgraph/federated_methods.py
Outdated
| if args.use_cluster: | ||
| monitor = monitor or Monitor() | ||
| monitor.init_time_end() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitor should only been initialized at same place, if not monitor throw an error and print error info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is correct, the reason why I do this is that I wanted to ensure monitor is always defined when use_cluster is true, in case it’s not passed correctly.
Gonna throw an error and print error info.
fedgraph/federated_methods.py
Outdated
| # Append paths relative to the current script's directory | ||
| # Use monitor passed from run_fedgraph | ||
| if args.use_cluster: | ||
| monitor = args.monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitor is an instance not a args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I tried this, cause if we round the project in local, there will be some error when we pass monitor, but I will change it to the original version if we will only run the project in cluster in the future. And I also found we repeatly initial Monitor in some parts of our original code. Like In the helper functions such as run_GC_selftrain, run_GC_Fed_algorithm and run_GCFL_algorithm, we still create a new Monitor instance each time. Should we need to change this
fedgraph/federated_methods.py
Outdated
| run_LP(args) | ||
|
|
||
| # End total communication timing | ||
| monitor.total_comm_time_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea why we need total_comm_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, already delete all the code that related total_comm_time. I think I misunderstand what you guys required. For now, I just calculate each initialization time for each run_NC, run_GC, run_LP.
fedgraph/federated_methods.py
Outdated
| monitor.init_time_start() | ||
| monitor.total_comm_time_start() | ||
|
|
||
| args.monitor = monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont do this way to pass with args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete it and change it back
distributed_job.py
Outdated
|
|
||
| ray.init(address="auto") | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already deleted.
fedgraph/federated_methods.py
Outdated
| if server.use_cluster: | ||
| monitor = Monitor() | ||
| # Use monitor from server | ||
| monitor = server.monitor if hasattr(server, "monitor") else Monitor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only initialize at one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goi it!
… tracking from cluster-specific metrics.
| if args.fedgraph_task == "NC": | ||
| run_NC(args, data) | ||
| run_NC(args, data, monitor) | ||
| elif args.fedgraph_task == "GC": | ||
| run_GC(args, data) | ||
| run_GC(args, data, monitor) | ||
| elif args.fedgraph_task == "LP": | ||
| run_LP(args) | ||
| run_LP(args, monitor) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this, always init inside of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that make sense, thanks for the advice. Already fix it.
| [trainer.relabel_adj.remote() for trainer in server.trainers] | ||
| if args.use_cluster: | ||
| monitor.pretrain_time_end(30) | ||
| monitor.train_time_start() | ||
|
|
||
| monitor.pretrain_time_end(30) | ||
| monitor.train_time_start() | ||
| ####################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could only sleep 30s when use_cluster
|
|
||
| if server.use_cluster: | ||
| if monitor is not None: | ||
| monitor.train_time_end(30) | ||
| fs = frame.style.apply(highlight_max).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init inside the function so it wont be none, too many if
| current_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| ray.init() | ||
| if args.use_cluster: | ||
| # Initialize monitor and start tracking initialization time | ||
| monitor = Monitor() | ||
| if args.use_cluster and monitor is not None: | ||
| monitor.init_time_start() | ||
|
|
||
| # Append paths relative to the current script's directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue, still use args.use_cluster
|
|
||
| monitor.pretrain_time_end(30) | ||
| monitor.pretrain_time_end(30 if args.use_cluster else 0) | ||
| monitor.train_time_start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too hack, could you do this if else inside of the monitor
fedgraph/federated_methods.py
Outdated
| args = kwds.get("args", {}) | ||
| self.use_encryption = ( | ||
| getattr(args, "use_encryption", False) | ||
| if hasattr(args, "use_encryption") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can lead to confusion, as the original args are lost after reassignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for the remind. I've renamed the local variable to args_obj to avoid shadowing *args, ensuring clarity and safety in this part.
| kind: ClusterConfig | ||
|
|
||
| metadata: | ||
| name: mlarge-1739510276 | ||
| region: us-east-1 | ||
|
|
||
| nodeGroups: | ||
| - name: head-nodes | ||
| instanceType: m5.24xlarge | ||
| desiredCapacity: 1 | ||
| minSize: 0 | ||
| maxSize: 1 | ||
| volumeSize: 256 | ||
| labels: | ||
| ray-node-type: head | ||
|
|
||
| - name: worker-nodes | ||
| instanceType: m5.16xlarge | ||
| desiredCapacity: 10 | ||
| minSize: 10 | ||
| maxSize: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why there's a bak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .bak file is a backup automatically created by Ray when updating the EKS cluster config to prevent accidental loss. It’s safe to ignore or delete if version control is used.
setup_cluster.sh
Outdated
| "pip": ["fsspec==2023.6.0", "huggingface_hub", "tenseal"] | ||
| }' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fsspec==2023.6.0" may be too strict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think we can relax the version constraint to just "fsspec" to avoid potential compatibility issues.
Ryan-YuanLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve
| print( | ||
| f"[Debug] Trainer running on node IP: {ray.util.get_node_ip_address()}" | ||
| ) | ||
|
|
||
| clients = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wonder is there some cases that trainers could run on the same ip because they are scheduled to the same pod by ray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it’s possible — if trainers’ resource demands are small and Ray schedules multiple trainers onto the same node or pod, they will share the same IP.
add communication monitoring during initialization in NC, GC, LP