Skip to content

import_ckpt isn't thread safe. #14479

@leoleoasd

Description

@leoleoasd

Describe the bug

A clear and concise description of what the bug is.

I'm using torchrun to launch a distributed training, in which I called model.import_ckpt. It will save tokenizer to /tmp/nemo_tokenizer and is not thread safe.

Steps/Code to reproduce bug

import torch
import lightning as pl
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
from transformers import AutoTokenizer

if __name__ == "__main__":
    seq_length = 4096
    global_batch_size = 16

    # tokenizer = get_nmt_tokenizer(
    #     "megatron", "GPT2BPETokenizer"
    # )
    tokenizer = get_nmt_tokenizer(
        library="huggingface",
        model_name='meta-llama/Meta-Llama-3-8B-Instruct',
        use_fast=True,
    )
    print(1)
    model = llm.LlamaModel(llm.Llama3Config8B(), tokenizer=tokenizer)
    print(2)
    print(tokenizer, model)
    model.config.seq_length = seq_length
    ckpt_path = model.import_ckpt(path='hf://meta-llama/Meta-Llama-3-8B-Instruct')
    print(ckpt_path)
    data = llm.SquadDataModule(seq_length=seq_length, global_batch_size=global_batch_size, tokenizer=tokenizer)
    print(3)
    ## initialize the strategy
    strategy = nl.MegatronStrategy(
        context_parallel_size=1,
        tensor_model_parallel_size=8,
        pipeline_model_parallel_size=1,
        pipeline_dtype=torch.bfloat16,
    )
    print(4)

    ## setup the optimizer
    opt_config = OptimizerConfig(
        optimizer='adam',
        lr=6e-4,
        bf16=True,
    )
    opt = nl.MegatronOptimizerModule(config=opt_config)
    print(5)
    trainer = nl.Trainer(
        devices=8, ## you can change the number of devices to suit your setup
        max_steps=50,
        accelerator="gpu",
        strategy=strategy,
        plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
    )
    wandb_logger = pl.pytorch.loggers.WandbLogger()
    nemo_logger = nl.NeMoLogger(
        log_dir="test_logdir", ## logs and checkpoints will be written here
        wandb=wandb_logger,
    )

    # llm.train(
    #     model=model,
    #     data=data,
    #     trainer=trainer,
    #     log=nemo_logger,
    #     tokenizer='data',
    #     optim=opt,
    # )
    print(6)
    trainer.fit(model, data, ckpt_path=ckpt_path)

Expected behavior

Model loads.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions