Skip to content

Conversation

aicrumb
Copy link
Contributor

@aicrumb aicrumb commented Dec 12, 2022

Uses bitsandbytes adam optimizer instead of torch, adds very simple gradient accumulation, finetuning only bias/layernorms (tested, works very well and is very fast) and allows for different precisions easier. (sorry for the absolute mass of commits, I was making them on github.com)

@jon-tow
Copy link
Collaborator

jon-tow commented Dec 16, 2022

Hey @Crumb! Just getting back to this. I think it's best to separate the 3 new features into their own PRs to avoid unwieldy commit histories (if there's an issue in one feature we'd have to revert the other features as well).

Let's make this a PR focused on adding the bitsandbytes optimizer support. I believe there might some more thought that needs to go into adding grad accumulation; it's possible accelerate handles this in subtly different ways for deepspeed integration (which we rely heavily on) and it should be properly tested.

Hope that's alright with you 🙏

(Also, no worries about the many commits! We squash them before merging)

@LouisCastricato
Copy link
Contributor

lets make the bitsandbytes pr

@jon-tow jon-tow added this to the v0.4.0 milestone Jan 2, 2023
@jon-tow jon-tow changed the title bnb adam, grad accumulation, dtype param Add bitsandbytes optimizer support Jan 2, 2023
@jon-tow
Copy link
Collaborator

jon-tow commented Jan 4, 2023

Details on latest changes

  • We opt to not convert nn.Embedding layers into bnb.nn.StableEmbedding layers since trlx is essentially a fine-tuning library; according to Tim Dettmers:

    "StableEmbedding layer is only required if the model was pretrained with the StableEmbedding layer."

    See relevant discussion here.

    • Instead, we take the more general approach of forcing nn.Embedding layers into 32-bit precision following advice from Tim Dettmers:

      "for pretrained models, the best would be to use the 32-bit optimized embedding layer (bnb.nn.Embedding) and no layer norm if the pretrained model was not trained with a [StableEmbedding] layer norm."

      Again, relevant discussion here.

Reports

  • In practice, we can see ~13% memory savings for GPT-J when trained with 8 unfrozen layers. Note that large memory savings occur only when a large portion of the model is unfrozen (num_layers_unfrozen is approx. some large fraction of total model layers).
    • There seems to be a slight penalty for using 8-bit optimizers but nothing significantly divergent - can possibly be overcome with improved hyper-parameter tuning.
    • Relevant wandb report: here

CC @LouisCastricato

@LouisCastricato
Copy link
Contributor

LouisCastricato commented Jan 4, 2023

Looks good to me, let's merge.

@jon-tow jon-tow merged commit 17df88e into CarperAI:main Jan 4, 2023
@maxreciprocate
Copy link
Collaborator

hey @jon-tow, your wandb report (https://wandb.ai/carperai/trlx/reports/trlx-Add-bitsandbytes-optimizer-support-133--VmlldzozMjU5OTQx) doesn't render for me for some reason (there are no data in charts as if the run set is empty), is it again some wandb's shenanigans or just me 😴

@LouisCastricato
Copy link
Contributor

It was rendering yesterday. I think wandb is just being weird.

@jon-tow
Copy link
Collaborator

jon-tow commented Jan 4, 2023

@reciprocated those were jon-tow shenanigans 😅 Yesterday I discovered deleting a run also deletes its data from reports (which makes sense in hindsight). I re-ran them from my personal wandb account - try this link: https://wandb.ai/jon-tow/trlx/reports/trlx-Add-bitsandbytes-optimizer-support-133--VmlldzozMjY1MzI1

@maxreciprocate
Copy link
Collaborator

@jon-tow that link gives 404 for me 😓 you have to share a "magic" link instead

@maxreciprocate
Copy link
Collaborator

@jon-tow Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants