Skip to content

[Misc]: Invariant encountered: value was None when it should not be #10284

@nithingovindugari

Description

@nithingovindugari

I am working on a use case of loading a model with parallel gpus, then unloading the model, and loading a new model in the same process.

@classmethod
    async def unload_models(cls, exiting=False) -> None:
        try:
            if cls._loaded_models:
                logging.info("log: unloading all cached models.")
                torch.multiprocessing.set_start_method("spawn", force=True)
                destroy_model_parallel()
                for model_id in list(cls._loaded_models.keys()):
                    del cls._loaded_models[model_id].llm_engine
                    del cls._loaded_models[model_id]
                gc.collect()
                torch.cuda.empty_cache()
                torch.distributed.destroy_process_group()

although, when i use only 1 gpu this works, but when tensor_parallel_size is 2 or greater, its giving me the following error once the new model is being loaded:


2024-11-12 22:16:18 - [INFO] - log: unloading all cached models.
INFO 11-12 22:16:19 multiproc_worker_utils.py:133] Terminating local vLLM worker processes
(VllmWorkerProcess pid=3574636) INFO 11-12 22:16:19 multiproc_worker_utils.py:240] Worker exiting
[rank1]:[W1112 22:16:20.271056884 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
2024-11-12 22:16:24 - [INFO] - log: loading model: 0192b1c6-dedc-7edf-9ff5-4da14b931b21 on GPUs: [0, 1]
INFO 11-12 22:16:29 config.py:905] Defaulting to use mp for distributed inference
INFO 11-12 22:16:29 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
(VllmWorkerProcess pid=3575975) INFO 11-12 22:16:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
2024-11-12 22:16:34 - [ERROR] - log: error loading model 0192b1c6-dedc-7edf-9ff5-4da14b931b21: Invariant encountered: value was None when it should not be
2024-11-12 22:16:34 - [INFO] - log: sending IDLE heartbeat...
2024-11-12 22:16:34 - [ERROR] - log: [job: 18] failed to process job: error occured while loading model - Invariant encountered: value was None when it should not be
INFO 11-12 22:16:34 multiproc_worker_utils.py:133] Terminating local vLLM worker processes

in specific: Invariant encountered: value was None when it should not be

I have tried everything I can find online - do you have any suggestions? your insight is greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    miscstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions