-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
A clear and concise description of what the bug is.
I'm using torchrun
to launch a distributed training, in which I called model.import_ckpt
. It will save tokenizer to /tmp/nemo_tokenizer
and is not thread safe.
Steps/Code to reproduce bug
import torch
import lightning as pl
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
from transformers import AutoTokenizer
if __name__ == "__main__":
seq_length = 4096
global_batch_size = 16
# tokenizer = get_nmt_tokenizer(
# "megatron", "GPT2BPETokenizer"
# )
tokenizer = get_nmt_tokenizer(
library="huggingface",
model_name='meta-llama/Meta-Llama-3-8B-Instruct',
use_fast=True,
)
print(1)
model = llm.LlamaModel(llm.Llama3Config8B(), tokenizer=tokenizer)
print(2)
print(tokenizer, model)
model.config.seq_length = seq_length
ckpt_path = model.import_ckpt(path='hf://meta-llama/Meta-Llama-3-8B-Instruct')
print(ckpt_path)
data = llm.SquadDataModule(seq_length=seq_length, global_batch_size=global_batch_size, tokenizer=tokenizer)
print(3)
## initialize the strategy
strategy = nl.MegatronStrategy(
context_parallel_size=1,
tensor_model_parallel_size=8,
pipeline_model_parallel_size=1,
pipeline_dtype=torch.bfloat16,
)
print(4)
## setup the optimizer
opt_config = OptimizerConfig(
optimizer='adam',
lr=6e-4,
bf16=True,
)
opt = nl.MegatronOptimizerModule(config=opt_config)
print(5)
trainer = nl.Trainer(
devices=8, ## you can change the number of devices to suit your setup
max_steps=50,
accelerator="gpu",
strategy=strategy,
plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
)
wandb_logger = pl.pytorch.loggers.WandbLogger()
nemo_logger = nl.NeMoLogger(
log_dir="test_logdir", ## logs and checkpoints will be written here
wandb=wandb_logger,
)
# llm.train(
# model=model,
# data=data,
# trainer=trainer,
# log=nemo_logger,
# tokenizer='data',
# optim=opt,
# )
print(6)
trainer.fit(model, data, ckpt_path=ckpt_path)
Expected behavior
Model loads.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull
&docker run
commands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working