[BugFix] Explicitly set gpu_memory_utilization #1560

rahul-tuli · 2025-06-17T04:36:08Z

We started seeing the following failure with vllm ph-3 kv_cache quantization e2e test starting from version 0.9.1; using one 80GB A100

2025-06-17T00:26:38.269335-0400 | test_vllm | INFO - ================= RUNNING vLLM =========================
INFO 06-17 00:26:46 [config.py:823] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 06-17 00:26:46 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 06-17 00:26:46 [utils.py:2597] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
WARNING 06-17 00:26:47 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-17 00:26:49 [__init__.py:244] Automatically detected platform cuda.
INFO 06-17 00:26:53 [core.py:455] Waiting for init message from front-end.
INFO 06-17 00:26:53 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Phi-3-mini-4k-instruct-kv_cache_default_phi3', speculative_config=None, tokenizer='Phi-3-mini-4k-instruct-kv_cache_default_phi3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Phi-3-mini-4k-instruct-kv_cache_default_phi3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-17 00:26:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4cc29e7f70>
ERROR 06-17 00:26:53 [core.py:515] EngineCore failed to start.
ERROR 06-17 00:26:53 [core.py:515] Traceback (most recent call last):
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-17 00:26:53 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-17 00:26:53 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 76, in __init__
ERROR 06-17 00:26:53 [core.py:515]     self.model_executor = executor_class(vllm_config)
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 06-17 00:26:53 [core.py:515]     self._init_executor()
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 06-17 00:26:53 [core.py:515]     self.collective_rpc("init_device")
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-17 00:26:53 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 06-17 00:26:53 [core.py:515]     return func(*args, **kwargs)
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 606, in init_device
ERROR 06-17 00:26:53 [core.py:515]     self.worker.init_device()  # type: ignore
ERROR 06-17 00:26:53 [core.py:515]   File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 140, in init_device
ERROR 06-17 00:26:53 [core.py:515]     raise ValueError(
ERROR 06-17 00:26:53 [core.py:515] ValueError: Free memory on device (70.82/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
---------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.80it/s]
100%|██████████| 391/391 [00:00<00:00, 645912.90it/s]
Calibrating weights: 100%|██████████| 391/391 [00:00<00:00, 334961.78it/s]
Calibrating: 100%|██████████| 256/256 [00:28<00:00,  9.00it/s]
Compressing model: 391it [00:00, 957928.07it/s]
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
    raise e
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 76, in __init__
    self.model_executor = executor_class(vllm_config)
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
    self.collective_rpc("init_device")
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method
    return func(*args, **kwargs)
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 606, in init_device
    self.worker.init_device()  # type: ignore
  File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 140, in init_device
    raise ValueError(
ValueError: Free memory on device (70.82/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
============================================================================== warnings summary ==============================================================================
tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml]
  /home/rahul/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================== short test summary info ===========================================================================
FAILED tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml] - RuntimeError: Engine core initialization failed. See root cause above.

Explicitly setting gpu_memory_utilization to 0.8 fixes our test. This is a temporary solution till we ascertain why is vllm asking for memory now compared to version 0.9.0

CUDA_VISIBLE_DEVICES=0 pytest -v /home/rahul/llm-compressor/tests/e2e/vLLM/test_vllm.py
============================================================================ test session starts =============================================================================
platform linux -- Python 3.10.12, pytest-8.4.0, pluggy-1.6.0 -- /home/rahul/llm-compressor/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/rahul/llm-compressor
configfile: pyproject.toml
plugins: anyio-4.9.0, rerunfailures-15.1, mock-3.14.1
collected 1 item                                                                                                                                                             

tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml] PASSED                                                    [100%]

===================================================================== warnings summary ======================================================================
tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml]
  /home/rahul/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================= 1 passed, 1 warning in 130.84s (0:02:10) ==========================================================

gemini-code-assist · 2025-06-17T04:36:11Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions · 2025-06-17T04:36:15Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Rahul Tuli <[email protected]>

brian-dellabetta

yeah, I think this makes sense now that vllm 0.9.1 seems to be more strict in its checking for memory available. we may need to add this to other configs moving forward, I seem to be hitting this more frequently in local runs

dsikka

Have we figured out why it only impacts this particular test case?

tests/e2e/vLLM/test_vllm.py

- Remove hardcoded gpu_memory_utilization default value - Let vLLM handle its own defaults when parameter is not specified - Use consistent kwargs pattern for all LLM initialization parameters - Only pass parameters that are explicitly set in config This change addresses the PR comment about not overriding vLLM's defaults unnecessarily and makes the code more maintainable. Signed-off-by: Rahul <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Signed-off-by: Rahul Tuli <[email protected]>

We started seeing the following failure with vllm ph-3 kv_cache quantization e2e test starting from version 0.9.1; using one 80GB A100 ```bash 2025-06-17T00:26:38.269335-0400 | test_vllm | INFO - ================= RUNNING vLLM ========================= INFO 06-17 00:26:46 [config.py:823] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'. INFO 06-17 00:26:46 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192. WARNING 06-17 00:26:46 [utils.py:2597] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized WARNING 06-17 00:26:47 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: NVIDIA/nccl#1234 INFO 06-17 00:26:49 [__init__.py:244] Automatically detected platform cuda. INFO 06-17 00:26:53 [core.py:455] Waiting for init message from front-end. INFO 06-17 00:26:53 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Phi-3-mini-4k-instruct-kv_cache_default_phi3', speculative_config=None, tokenizer='Phi-3-mini-4k-instruct-kv_cache_default_phi3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Phi-3-mini-4k-instruct-kv_cache_default_phi3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null} WARNING 06-17 00:26:53 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4cc29e7f70> ERROR 06-17 00:26:53 [core.py:515] EngineCore failed to start. ERROR 06-17 00:26:53 [core.py:515] Traceback (most recent call last): ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core ERROR 06-17 00:26:53 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs) ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 390, in __init__ ERROR 06-17 00:26:53 [core.py:515] super().__init__(vllm_config, executor_class, log_stats, ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 76, in __init__ ERROR 06-17 00:26:53 [core.py:515] self.model_executor = executor_class(vllm_config) ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 53, in __init__ ERROR 06-17 00:26:53 [core.py:515] self._init_executor() ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor ERROR 06-17 00:26:53 [core.py:515] self.collective_rpc("init_device") ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc ERROR 06-17 00:26:53 [core.py:515] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method ERROR 06-17 00:26:53 [core.py:515] return func(*args, **kwargs) ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 606, in init_device ERROR 06-17 00:26:53 [core.py:515] self.worker.init_device() # type: ignore ERROR 06-17 00:26:53 [core.py:515] File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 140, in init_device ERROR 06-17 00:26:53 [core.py:515] raise ValueError( ERROR 06-17 00:26:53 [core.py:515] ValueError: Free memory on device (70.82/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes. ---------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------- Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.80it/s] 100%|██████████| 391/391 [00:00<00:00, 645912.90it/s] Calibrating weights: 100%|██████████| 391/391 [00:00<00:00, 334961.78it/s] Calibrating: 100%|██████████| 256/256 [00:28<00:00, 9.00it/s] Compressing model: 391it [00:00, 957928.07it/s] Process EngineCore_0: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core raise e File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core engine_core = EngineCoreProc(*args, **kwargs) File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 390, in __init__ super().__init__(vllm_config, executor_class, log_stats, File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 76, in __init__ self.model_executor = executor_class(vllm_config) File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 53, in __init__ self._init_executor() File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor self.collective_rpc("init_device") File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method return func(*args, **kwargs) File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 606, in init_device self.worker.init_device() # type: ignore File "/home/rahul/llm-compressor/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 140, in init_device raise ValueError( ValueError: Free memory on device (70.82/79.25 GiB) on startup is less than desired GPU memory utilization (0.9, 71.33 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes. ============================================================================== warnings summary ============================================================================== tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml] /home/rahul/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================== short test summary info =========================================================================== FAILED tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml] - RuntimeError: Engine core initialization failed. See root cause above. ``` Explicitly setting gpu_memory_utilization to 0.8 fixes our test. This is a temporary solution till we ascertain why is vllm asking for memory now compared to version 0.9.0 ``` CUDA_VISIBLE_DEVICES=0 pytest -v /home/rahul/llm-compressor/tests/e2e/vLLM/test_vllm.py ============================================================================ test session starts ============================================================================= platform linux -- Python 3.10.12, pytest-8.4.0, pluggy-1.6.0 -- /home/rahul/llm-compressor/.venv/bin/python3 cachedir: .pytest_cache rootdir: /home/rahul/llm-compressor configfile: pyproject.toml plugins: anyio-4.9.0, rerunfailures-15.1, mock-3.14.1 collected 1 item tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml] PASSED [100%] ===================================================================== warnings summary ====================================================================== tests/e2e/vLLM/test_vllm.py::TestvLLM::test_vllm[tests/e2e/vLLM/configs/kv_cache_phi3.yaml] /home/rahul/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================= 1 passed, 1 warning in 130.84s (0:02:10) ========================================================== ``` --------- Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Rahul <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

rahul-tuli requested review from dsikka, brian-dellabetta and kylesayrs June 17, 2025 04:36

Explicitly set gpu_memory_utilization

7cd07e6

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli force-pushed the fix-e2e-phi-3-test branch from ee0712f to 7cd07e6 Compare June 17, 2025 04:37

rahul-tuli self-assigned this Jun 17, 2025

brian-dellabetta previously approved these changes Jun 17, 2025

View reviewed changes

dsikka reviewed Jun 17, 2025

View reviewed changes

Merge branch 'main' into fix-e2e-phi-3-test

68ae878

dsikka requested changes Jun 17, 2025

View reviewed changes

tests/e2e/vLLM/test_vllm.py Outdated Show resolved Hide resolved

rahul-tuli dismissed brian-dellabetta’s stale review via fa30828 June 17, 2025 20:17

brian-dellabetta previously approved these changes Jun 17, 2025

View reviewed changes

Style

0b43ccb

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli dismissed brian-dellabetta’s stale review via 0b43ccb June 17, 2025 20:20

brian-dellabetta approved these changes Jun 17, 2025

View reviewed changes

dsikka approved these changes Jun 17, 2025

View reviewed changes

Merge branch 'main' into fix-e2e-phi-3-test

f8a5a00

dsikka enabled auto-merge (squash) June 17, 2025 21:11

dsikka merged commit cda2359 into main Jun 18, 2025
11 checks passed

dsikka deleted the fix-e2e-phi-3-test branch June 18, 2025 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Explicitly set gpu_memory_utilization #1560

[BugFix] Explicitly set gpu_memory_utilization #1560

Uh oh!

rahul-tuli commented Jun 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

brian-dellabetta left a comment

Uh oh!

dsikka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[BugFix] Explicitly set gpu_memory_utilization #1560

[BugFix] Explicitly set gpu_memory_utilization #1560

Uh oh!

Conversation

rahul-tuli commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli commented Jun 17, 2025 •

edited

Loading