Skip to content

Conversation

maxreciprocate
Copy link
Collaborator

@maxreciprocate maxreciprocate commented Mar 6, 2023

This PR adds tools to cross-reference example runs from different branches.

As an example, the following command will git clone add-benchmark-tools branch and run a set of examples on it, storing them under trlx-references project name under a private username:

python -m trlx.reference add-benchmark-tools

Also it will run the same set of examples on main by default, unless they were already present on the trlx-references project name, and make a w&b report comparing the two sets. Example of a report: https://wandb.ai/sorry/trlx-references/reports/add-benchmark-tools-v-main--VmlldzozOTAwMTQy

Presency of examples' runs is determined by a tags config field which contains hash of all files (excluding markdown). If runs are made on top of some PR, and after the PR is merged into the main, the total hash of files will not change (unlike git commit hash), so that runs won't have to be remade again.

By default runs are referenced against CarperAI/trlx's main, but optionally they can be referenced against any branch:

python -m trlx.reference xu-song/trlx:fix1 --against CarperAI/trlx:convert-examples-configs

@maxreciprocate maxreciprocate requested a review from cat-state March 6, 2023 16:25
@maxreciprocate maxreciprocate marked this pull request as ready for review March 13, 2023 19:31
@maxreciprocate maxreciprocate requested a review from Dahoas March 13, 2023 19:54
Copy link
Collaborator

@cat-state cat-state left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @reciprocated ! This looks good to me mostly except for a few suggestions, however when I tried to run it I get

  File "/mnt/nvme/home/uwu/conda/nemo-113/lib/python3.8/site-packages/wandb/sdk/wandb_require_helpers.py", line 37, in __init__
    self._check_if_requirements_met()
  File "/mnt/nvme/home/uwu/conda/nemo-113/lib/python3.8/site-packages/wandb/sdk/wandb_require_helpers.py", line 48, in _check_if_requirements_met
    raise Exception(
Exception: You must explicitly enable this feature with `wandb.require("report-editing:v0)"

@@ -0,0 +1,45 @@
From 7e7144868f71f437e10a15868b99d2cfcc571f3e Mon Sep 17 00:00:00 2001
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should hide this file away in a subfolder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to delete this altogether right before this pr is merged, since it just patches other branches with a subset of changes this pr introduces

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, that makes sense


rm -rf ../benchmark_logs && mkdir ../benchmark_logs

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes 1 --config_file configs/accelerate/zero2-bf16.yaml --main_process_port 8880 examples/ppo_sentiments.py "$args" > ../benchmark_logs/ppo_sentiments.log 2>&1 &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it happen to change the timings at all if you run all of them together on the same machine vs one by one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
pip install -e .

args='{"train": {"project_name": "trlx-references", "entity_name": '$entity', "tags": ["'$hash'"]}}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding the GPU name to the tags (interconnect would be cool too if there's some easy way to get that)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already logged under "System Hardware" in w&b as GPU type: NVIDIA A100-SXM4-80GB, or do you want to see it specifically as a tag?


python -m venv venv
. venv/bin/activate
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
Copy link
Collaborator

@cat-state cat-state Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some way to do this without mutating the environment e.g using conda create to make a fresh env? since this can break stuff you had installed depending on one version of torch.
Also would probably be good to pin a torch version and tag it? Or otherwise save the pip freeze output somewhere in the run's metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also failed to install for me with

  Using cached https://files.pythonhosted.org/packages/bc/c3/f068337a370801f372f2f8f6bad74a5c140f6fda3d9de154052708dd3c65/Jinja2-3.1.2-py3-none-any.whl
Collecting triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" (from torch)
  ERROR: Could not find a version that satisfies the requirement triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: 0.4.1, 0.4.2)                                                 
ERROR: No matching distribution found for triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64" (from torch)                         

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup directly reflects installation instructions from the https://github.com/CarperAI/trlx#readme. However I wouldn't mind pinning every dependency down both here and there

Also this is odd because I can install triton==2.0.0 on uname -i == x86_64, moreover we probably use the same hardware here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to work if i add a python -m pip install pip --upgrade right before installing torch in benchmark.sh, the pip in the created venv was too old.

@@ -0,0 +1,63 @@
#!/bin/bash

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add set args to make script die on error

@maxreciprocate
Copy link
Collaborator Author

@cat-state you must be using some older wandb version, could you please try it again after upgrading to wandb==0.14.0?

Copy link
Collaborator

@cat-state cat-state left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i was able to run it after updating wandb. I think this would be good to merge after adding the pip upgrade and the $dir check. I also still get

wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 403: Forbidden)

even though I have a trlx-references project in my wandb.

set -ex
dir=`mktemp -d -p .`
cd $dir
trap "rm -rf ../$dir" EXIT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any case where $dir could be null/empty string? might be good to add as otherwise you could accidentally rm -rf the containing directory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think there is, but an additional check would hurt either


python -m venv venv
. venv/bin/activate
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to work if i add a python -m pip install pip --upgrade right before installing torch in benchmark.sh, the pip in the created venv was too old.

branch=main
entity=null
only_hash=false
only_tiny=false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no way to set this via references.py

Copy link
Collaborator

@cat-state cat-state left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would be good to add a link to the trlx-references wandb in the readme

@cat-state
Copy link
Collaborator

Got it to work after changing my default entity from carperai to myself, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants