-
Notifications
You must be signed in to change notification settings - Fork 482
[feat] Add benchmark tools #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @reciprocated ! This looks good to me mostly except for a few suggestions, however when I tried to run it I get
File "/mnt/nvme/home/uwu/conda/nemo-113/lib/python3.8/site-packages/wandb/sdk/wandb_require_helpers.py", line 37, in __init__
self._check_if_requirements_met()
File "/mnt/nvme/home/uwu/conda/nemo-113/lib/python3.8/site-packages/wandb/sdk/wandb_require_helpers.py", line 48, in _check_if_requirements_met
raise Exception(
Exception: You must explicitly enable this feature with `wandb.require("report-editing:v0)"
@@ -0,0 +1,45 @@ | |||
From 7e7144868f71f437e10a15868b99d2cfcc571f3e Mon Sep 17 00:00:00 2001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should hide this file away in a subfolder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to delete this altogether right before this pr is merged, since it just patches other branches with a subset of changes this pr introduces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, that makes sense
|
||
rm -rf ../benchmark_logs && mkdir ../benchmark_logs | ||
|
||
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes 1 --config_file configs/accelerate/zero2-bf16.yaml --main_process_port 8880 examples/ppo_sentiments.py "$args" > ../benchmark_logs/ppo_sentiments.log 2>&1 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it happen to change the timings at all if you run all of them together on the same machine vs one by one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing noticable https://wandb.ai/sorry/trlx/reports/Timing-difference--VmlldzozODA4MTA4
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117 | ||
pip install -e . | ||
|
||
args='{"train": {"project_name": "trlx-references", "entity_name": '$entity', "tags": ["'$hash'"]}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth adding the GPU name to the tags (interconnect would be cool too if there's some easy way to get that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already logged under "System Hardware" in w&b as GPU type: NVIDIA A100-SXM4-80GB
, or do you want to see it specifically as a tag?
scripts/benchmark.sh
Outdated
|
||
python -m venv venv | ||
. venv/bin/activate | ||
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some way to do this without mutating the environment e.g using conda create
to make a fresh env? since this can break stuff you had installed depending on one version of torch.
Also would probably be good to pin a torch version and tag it? Or otherwise save the pip freeze
output somewhere in the run's metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also failed to install for me with
Using cached https://files.pythonhosted.org/packages/bc/c3/f068337a370801f372f2f8f6bad74a5c140f6fda3d9de154052708dd3c65/Jinja2-3.1.2-py3-none-any.whl
Collecting triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" (from torch)
ERROR: Could not find a version that satisfies the requirement triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: 0.4.1, 0.4.2)
ERROR: No matching distribution found for triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" (from torch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup directly reflects installation instructions from the https://github.com/CarperAI/trlx#readme. However I wouldn't mind pinning every dependency down both here and there
Also this is odd because I can install triton==2.0.0
on uname -i == x86_64
, moreover we probably use the same hardware here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to work if i add a python -m pip install pip --upgrade
right before installing torch in benchmark.sh, the pip in the created venv was too old.
@@ -0,0 +1,63 @@ | |||
#!/bin/bash | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add set
args to make script die on error
@cat-state you must be using some older |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i was able to run it after updating wandb. I think this would be good to merge after adding the pip upgrade and the $dir
check. I also still get
wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 403: Forbidden)
even though I have a trlx-references
project in my wandb.
set -ex | ||
dir=`mktemp -d -p .` | ||
cd $dir | ||
trap "rm -rf ../$dir" EXIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any case where $dir could be null/empty string? might be good to add as otherwise you could accidentally rm -rf the containing directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think there is, but an additional check would hurt either
scripts/benchmark.sh
Outdated
|
||
python -m venv venv | ||
. venv/bin/activate | ||
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to work if i add a python -m pip install pip --upgrade
right before installing torch in benchmark.sh, the pip in the created venv was too old.
branch=main | ||
entity=null | ||
only_hash=false | ||
only_tiny=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no way to set this via references.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would be good to add a link to the trlx-references wandb in the readme
Got it to work after changing my default entity from carperai to myself, thank you! |
This PR adds tools to cross-reference example runs from different branches.
As an example, the following command will
git clone
add-benchmark-tools branch and run a set of examples on it, storing them undertrlx-references
project name under a private username:Also it will run the same set of examples on
main
by default, unless they were already present on thetrlx-references
project name, and make a w&b report comparing the two sets. Example of a report: https://wandb.ai/sorry/trlx-references/reports/add-benchmark-tools-v-main--VmlldzozOTAwMTQyPresency of examples' runs is determined by a
tags
config field which contains hash of all files (excluding markdown). If runs are made on top of some PR, and after the PR is merged into themain
, the total hash of files will not change (unlike git commit hash), so that runs won't have to be remade again.By default runs are referenced against CarperAI/trlx's
main
, but optionally they can be referenced against any branch: