-
-
Notifications
You must be signed in to change notification settings - Fork 10k
Closed as not planned
Closed as not planned
Copy link
Labels
Description
I am working on a use case of loading a model with parallel gpus, then unloading the model, and loading a new model in the same process.
@classmethod
async def unload_models(cls, exiting=False) -> None:
try:
if cls._loaded_models:
logging.info("log: unloading all cached models.")
torch.multiprocessing.set_start_method("spawn", force=True)
destroy_model_parallel()
for model_id in list(cls._loaded_models.keys()):
del cls._loaded_models[model_id].llm_engine
del cls._loaded_models[model_id]
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
although, when i use only 1 gpu this works, but when tensor_parallel_size is 2 or greater, its giving me the following error once the new model is being loaded:
2024-11-12 22:16:18 - [INFO] - log: unloading all cached models.
INFO 11-12 22:16:19 multiproc_worker_utils.py:133] Terminating local vLLM worker processes
(VllmWorkerProcess pid=3574636) INFO 11-12 22:16:19 multiproc_worker_utils.py:240] Worker exiting
[rank1]:[W1112 22:16:20.271056884 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
2024-11-12 22:16:24 - [INFO] - log: loading model: 0192b1c6-dedc-7edf-9ff5-4da14b931b21 on GPUs: [0, 1]
INFO 11-12 22:16:29 config.py:905] Defaulting to use mp for distributed inference
INFO 11-12 22:16:29 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
(VllmWorkerProcess pid=3575975) INFO 11-12 22:16:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
2024-11-12 22:16:34 - [ERROR] - log: error loading model 0192b1c6-dedc-7edf-9ff5-4da14b931b21: Invariant encountered: value was None when it should not be
2024-11-12 22:16:34 - [INFO] - log: sending IDLE heartbeat...
2024-11-12 22:16:34 - [ERROR] - log: [job: 18] failed to process job: error occured while loading model - Invariant encountered: value was None when it should not be
INFO 11-12 22:16:34 multiproc_worker_utils.py:133] Terminating local vLLM worker processes
in specific: Invariant encountered: value was None when it should not be
I have tried everything I can find online - do you have any suggestions? your insight is greatly appreciated.