Skip to content

Commit e472e07

Browse files
mayank31398yikangshenArthurZucker
authored
Granitemoe (#33207)
* first commit * drop tokenizer * drop tokenizer * drop tokenizer * drop convert * granite * drop tokenization test * mup * fix * reformat * reformat * reformat * fix docs * stop checking for checkpoint * update support * attention multiplier * update model * tiny drop * saibo drop * skip test * fix test * fix test * drop * drop useless imports * update docs * drop flash function * copied from * drop pretraining tp * drop pretraining tp * drop pretraining tp * drop unused import * drop code path * change name * softmax scale * head dim * drop legacy cache * rename params * cleanup * fix copies * comments * add back legacy cache * multipliers * multipliers * multipliers * text fix * fix copies * merge * multipliers * attention multiplier * drop unused imports * add granitemoe * add decoration * remove moe from sequenceclassification * fix test * fix * fix * fix * move rope? * merge * drop bias * drop bias * Update src/transformers/models/granite/configuration_granite.py Co-authored-by: Arthur <[email protected]> * fix * Update src/transformers/models/granite/modeling_granite.py Co-authored-by: Arthur <[email protected]> * fix * fix * fix * fix * drop * drop * fix * fix * cleanup * cleanup * fix * fix granite tests * fp32 test * fix * drop jitter * fix * rename * rename * fix config * add gen test --------- Co-authored-by: Yikang Shen <[email protected]> Co-authored-by: Arthur <[email protected]>
1 parent 49a0bef commit e472e07

File tree

16 files changed

+2393
-58
lines changed

16 files changed

+2393
-58
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -424,6 +424,8 @@
424424
title: GPTSw3
425425
- local: model_doc/granite
426426
title: Granite
427+
- local: model_doc/granitemoe
428+
title: GraniteMoe
427429
- local: model_doc/herbert
428430
title: HerBERT
429431
- local: model_doc/ibert

docs/source/en/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,7 @@ Flax), PyTorch, and/or TensorFlow.
159159
| [GPTBigCode](model_doc/gpt_bigcode) ||||
160160
| [GPTSAN-japanese](model_doc/gptsan-japanese) ||||
161161
| [Granite](model_doc/granite) ||||
162+
| [GraniteMoeMoe](model_doc/granitemoe) ||||
162163
| [Graphormer](model_doc/graphormer) ||||
163164
| [Grounding DINO](model_doc/grounding-dino) ||||
164165
| [GroupViT](model_doc/groupvit) ||||
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# GraniteMoe
18+
19+
## Overview
20+
21+
The GraniteMoe model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
22+
23+
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
24+
25+
The abstract from the paper is the following:
26+
27+
*Finding the optimal learning rate for language model pretraining is a challenging task.
28+
This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored.
29+
In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (\mup) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models.
30+
We [open source](https://huggingface.co/collections/ibm/power-lm-66be64ae647ddf11b9808000) these pretrained models.*
31+
32+
Tips:
33+
34+
```python
35+
import torch
36+
from transformers import AutoModelForCausalLM, AutoTokenizer
37+
38+
model_path = "ibm/PowerMoE-3b"
39+
tokenizer = AutoTokenizer.from_pretrained(model_path)
40+
41+
# drop device_map if running on CPU
42+
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
43+
model.eval()
44+
45+
# change input text as desired
46+
prompt = "Write a code to find the maximum value in a list of numbers."
47+
48+
# tokenize the text
49+
input_tokens = tokenizer(prompt, return_tensors="pt")
50+
# generate output tokens
51+
output = model.generate(**input_tokens, max_new_tokens=100)
52+
# decode output tokens into text
53+
output = tokenizer.batch_decode(output)
54+
# loop over the batch to print, in this example the batch size is 1
55+
for i in output:
56+
print(i)
57+
```
58+
59+
This model was contributed by [mayank-mishra](https://huggingface.co/mayank-mishra).
60+
61+
62+
## GraniteMoeConfig
63+
64+
[[autodoc]] GraniteMoeConfig
65+
66+
## GraniteMoeModel
67+
68+
[[autodoc]] GraniteMoeModel
69+
- forward
70+
71+
## GraniteMoeForCausalLM
72+
73+
[[autodoc]] GraniteMoeForCausalLM
74+
- forward

docs/source/en/perf_infer_gpu_one.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ FlashAttention-2 is currently supported for the following architectures:
5252
* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel)
5353
* [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJModel)
5454
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
55+
* [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel)
5556
* [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model)
5657
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
5758
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
@@ -226,6 +227,7 @@ For now, Transformers supports SDPA inference and training for the following arc
226227
* [Hubert](https://huggingface.co/docs/transformers/model_doc/hubert#transformers.HubertModel)
227228
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
228229
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
230+
* [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel)
229231
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
230232
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
231233
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)

src/transformers/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -465,6 +465,7 @@
465465
"models.gpt_sw3": [],
466466
"models.gptj": ["GPTJConfig"],
467467
"models.granite": ["GraniteConfig"],
468+
"models.granitemoe": ["GraniteMoeConfig"],
468469
"models.grounding_dino": [
469470
"GroundingDinoConfig",
470471
"GroundingDinoProcessor",
@@ -2343,6 +2344,13 @@
23432344
"GranitePreTrainedModel",
23442345
]
23452346
)
2347+
_import_structure["models.granitemoe"].extend(
2348+
[
2349+
"GraniteMoeForCausalLM",
2350+
"GraniteMoeModel",
2351+
"GraniteMoePreTrainedModel",
2352+
]
2353+
)
23462354
_import_structure["models.grounding_dino"].extend(
23472355
[
23482356
"GroundingDinoForObjectDetection",
@@ -5237,6 +5245,7 @@
52375245
)
52385246
from .models.gptj import GPTJConfig
52395247
from .models.granite import GraniteConfig
5248+
from .models.granitemoe import GraniteMoeConfig
52405249
from .models.grounding_dino import (
52415250
GroundingDinoConfig,
52425251
GroundingDinoProcessor,
@@ -6976,6 +6985,11 @@
69766985
GraniteModel,
69776986
GranitePreTrainedModel,
69786987
)
6988+
from .models.granitemoe import (
6989+
GraniteMoeForCausalLM,
6990+
GraniteMoeModel,
6991+
GraniteMoePreTrainedModel,
6992+
)
69796993
from .models.grounding_dino import (
69806994
GroundingDinoForObjectDetection,
69816995
GroundingDinoModel,

src/transformers/models/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@
106106
gpt_sw3,
107107
gptj,
108108
granite,
109+
granitemoe,
109110
grounding_dino,
110111
groupvit,
111112
herbert,

src/transformers/models/auto/configuration_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@
123123
("gptj", "GPTJConfig"),
124124
("gptsan-japanese", "GPTSanJapaneseConfig"),
125125
("granite", "GraniteConfig"),
126+
("granitemoe", "GraniteMoeConfig"),
126127
("graphormer", "GraphormerConfig"),
127128
("grounding-dino", "GroundingDinoConfig"),
128129
("groupvit", "GroupViTConfig"),
@@ -417,6 +418,7 @@
417418
("gptj", "GPT-J"),
418419
("gptsan-japanese", "GPTSAN-japanese"),
419420
("granite", "Granite"),
421+
("granitemoe", "GraniteMoeMoe"),
420422
("graphormer", "Graphormer"),
421423
("grounding-dino", "Grounding DINO"),
422424
("groupvit", "GroupViT"),

src/transformers/models/auto/modeling_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@
120120
("gptj", "GPTJModel"),
121121
("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"),
122122
("granite", "GraniteModel"),
123+
("granitemoe", "GraniteMoeModel"),
123124
("graphormer", "GraphormerModel"),
124125
("grounding-dino", "GroundingDinoModel"),
125126
("groupvit", "GroupViTModel"),
@@ -485,6 +486,7 @@
485486
("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"),
486487
("gptj", "GPTJForCausalLM"),
487488
("granite", "GraniteForCausalLM"),
489+
("granitemoe", "GraniteMoeForCausalLM"),
488490
("jamba", "JambaForCausalLM"),
489491
("jetmoe", "JetMoeForCausalLM"),
490492
("llama", "LlamaForCausalLM"),
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Copyright 2024 EleutherAI and The HuggingFace Inc. team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from typing import TYPE_CHECKING
15+
16+
from ...utils import (
17+
OptionalDependencyNotAvailable,
18+
_LazyModule,
19+
is_torch_available,
20+
)
21+
22+
23+
_import_structure = {
24+
"configuration_granitemoe": ["GraniteMoeConfig"],
25+
}
26+
27+
try:
28+
if not is_torch_available():
29+
raise OptionalDependencyNotAvailable()
30+
except OptionalDependencyNotAvailable:
31+
pass
32+
else:
33+
_import_structure["modeling_granitemoe"] = [
34+
"GraniteMoeForCausalLM",
35+
"GraniteMoeModel",
36+
"GraniteMoePreTrainedModel",
37+
]
38+
39+
if TYPE_CHECKING:
40+
from .configuration_granitemoe import GraniteMoeConfig
41+
42+
try:
43+
if not is_torch_available():
44+
raise OptionalDependencyNotAvailable()
45+
except OptionalDependencyNotAvailable:
46+
pass
47+
else:
48+
from .modeling_granitemoe import (
49+
GraniteMoeForCausalLM,
50+
GraniteMoeModel,
51+
GraniteMoePreTrainedModel,
52+
)
53+
54+
else:
55+
import sys
56+
57+
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

0 commit comments

Comments
 (0)