[V1][Frontend] Improve Shutdown And Logs #14048

rafvasq · 2025-02-28T15:33:45Z

In place of #11737

authored by: @robertgshaw2-redhat

SUMMARY:

Prior to this PR, if we encountered an error in a background process, we kill the process tree immediately, which means that we cannot cleanup resources and cannot return good status codes to clients. This PR overhauls the Error handling to instead shut down the background processes and raise Errors that allow us to return proper HTTP status codes to users
Prior to this PR, we were not properly shutting down when Errors occured during startup, especially in the TP case
Prior to this PR, we used signals to catch errors from background processes. Due to limitations of Python, this prevented us from running outside the main thread. This is a problem for deployments in TritonServer

DESIGN:

for errors during startup, we wrap init code with try...catch and push FAILED over the ready PIPE. This works well since the parent processes are waiting for confirmation
for errors during runtime, we wrap the busy loops with try..catch and push failure messages over the existing IPC mechanisms.
One weakness is that issues with the ipc mechanisms themselves are not handled explicitly. We will explore this more in a follow up

TEST MATRIX:

AsyncLLM, TP=1 + TP>1 --- runtime and startup
LLM (MP), TP=1, TP>1 --- runtime and startup
LLM (no-MP), TP=1, TP>1 --- runtime and startup

Fixes: #12690

Signed-off-by: [email protected] <[email protected]>

… handle properly Signed-off-by: [email protected] <[email protected]>

Signed-off-by: [email protected] <[email protected]>

njhill

I'd like to spend more time on this but left comments from a first pass.

njhill · 2025-03-02T17:06:44Z

vllm/v1/engine/async_llm.py

+            await self.abort(request_id)
+            if self.log_requests:
+                logger.info("Request %s failed.", request_id)
+            logger.exception("GOT EXCEPTION:", exc_info=e)


Is this line meant to be here? We should probably change the message if so, and there's no need to set exc_info=e since it will use that by default in the except block.

njhill · 2025-03-02T17:07:14Z

vllm/v1/engine/async_llm.py

+        except Exception as e:
+            await self.abort(request_id)
+            if self.log_requests:
+                logger.info("Request %s failed.", request_id)


Maybe add the exception message to the log line too? (but still single line)

njhill · 2025-03-02T17:09:15Z

vllm/v1/engine/async_llm.py

+        if self.output_handler is None:
+            return True
+
+        return not self.output_handler.done()


Suggested change

if self.output_handler is None:

return True

return not self.output_handler.done()

return self.output_handler is None or not self.output_handler.done()

njhill · 2025-03-02T17:10:57Z

vllm/v1/engine/async_llm.py

    @property
    def errored(self) -> bool:
-        return False
+        return (self.engine_core.is_engine_dead or not self.is_running)


nit:

Suggested change

return (self.engine_core.is_engine_dead or not self.is_running)

return self.engine_core.is_engine_dead or not self.is_running

njhill · 2025-03-02T17:15:33Z

vllm/v1/engine/core.py

+        except Exception as e:
+            logger.exception("EngineCore got an Exception during startup:",
+                             exc_info=e)
+            ready_pipe.send({"status": "FAILED"})
+            raise e
+
+        finally:
+            ready_pipe.close()


I'm not sure that this should go here. Probably better to catch/handle in the calling method, including handling sending the pipe ready message since it "owns" the pipe. In the failure case, I'm not sure that we need to send failed status back since the process could die for other reasons without us being able to do so, and we need to handle that equivalently from the client side.

njhill · 2025-03-02T17:34:19Z

vllm/v1/executor/multiproc_executor.py

+    @staticmethod
+    def wait_for_ready(
+            unready_proc_handle: UnreadyWorkerProcHandle) -> WorkerProcHandle:


why not a method on UnreadyProcHandle?

rafvasq · 2025-03-25T13:52:27Z

@robertgshaw2-redhat, can I help out with a refactor of this one again?

mergify · 2025-03-27T05:51:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rafvasq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

rafvasq · 2025-03-27T14:44:40Z

Back to #11737

[email protected] added 30 commits January 3, 2025 23:11

checkpoint prototype

eb16239

Signed-off-by: [email protected] <[email protected]>

Issue currently is with streaming. The HTTP exception handlers do not…

8549fdd

… handle properly Signed-off-by: [email protected] <[email protected]>

switch from ValueError -> Exception.

77801cd

merged

1bbc3a4

updated

8eca864

stash

b8c77b3

stash

ce9b8ef

add watchdog

3a760a7

updated

3024da0

revert spurious changes

5af8189

updated

3cb21bb

updated

7c97308

updated

ea6824a

remove cruft

b278065

cruft

c004bd4

stash

2556bc4

fix llama

db0b9e6

updated

f722589

cruft

de75cc4

cruft

ba5ca87

updated

4f6b68a

updated

949d425

updated

f67398b

updated

b3d2994

update comment

34a997a

update comment

32cf91b

fix more

c73801c

updated

1188845

udpatd

706782c

added exception file

1cc0915

[email protected] added 5 commits March 1, 2025 18:13

updated

6e823ad

Signed-off-by: [email protected] <[email protected]>

updated

7e3ffe8

Signed-off-by: [email protected] <[email protected]>

updatd

f405db8

Signed-off-by: [email protected] <[email protected]>

updated

867ff8f

Signed-off-by: [email protected] <[email protected]>

update to ensure we call shutdown on RPC error

a8403ac

Signed-off-by: [email protected] <[email protected]>

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 1, 2025

[email protected] added 12 commits March 1, 2025 20:36

fixed

18a4536

Signed-off-by: [email protected] <[email protected]>

updated

ed2759b

Signed-off-by: [email protected] <[email protected]>

updated

113255e

Signed-off-by: [email protected] <[email protected]>

updated

f79d23f

Signed-off-by: [email protected] <[email protected]>

updated

a09ef27

Signed-off-by: [email protected] <[email protected]>

updated

9b11b6c

Signed-off-by: [email protected] <[email protected]>

updated

1f7ed2e

Signed-off-by: [email protected] <[email protected]>

updated

fba8e41

Signed-off-by: [email protected] <[email protected]>

removed mp=0 tests

9a3c861

Signed-off-by: [email protected] <[email protected]>

removed non mp tests

38857d8

Signed-off-by: [email protected] <[email protected]>

removed non mp tests

ac06927

Signed-off-by: [email protected] <[email protected]>

fixed error

b002dcf

Signed-off-by: [email protected] <[email protected]>

njhill reviewed Mar 2, 2025

View reviewed changes

joerunde mentioned this pull request Mar 3, 2025

[V1][Core] Support for Structured Outputs #12388

Merged

njhill mentioned this pull request Mar 7, 2025

[MISC][V1] Register process killing handler only in the main thread #14380

Merged

joerunde mentioned this pull request Mar 10, 2025

[core][V1] pluggable scheduler #14466

Merged

joerunde mentioned this pull request Mar 19, 2025

✨ Reject requests, and upgrade vllm install to 0.8.0 vllm-project/vllm-spyre#37

Merged

rafvasq mentioned this pull request Mar 26, 2025

[V1][Frontend] Improve Shutdown And Logs #11737

Merged

mergify bot added the tpu Related to Google TPUs label Mar 27, 2025

mergify bot added the needs-rebase label Mar 27, 2025

rafvasq closed this Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Frontend] Improve Shutdown And Logs #14048

[V1][Frontend] Improve Shutdown And Logs #14048

Uh oh!

rafvasq commented Feb 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

njhill left a comment

Uh oh!

njhill Mar 2, 2025

Uh oh!

njhill Mar 2, 2025

Uh oh!

njhill Mar 2, 2025

Uh oh!

njhill Mar 2, 2025

Uh oh!

njhill Mar 2, 2025

Uh oh!

njhill Mar 2, 2025

Uh oh!

rafvasq commented Mar 25, 2025

Uh oh!

mergify bot commented Mar 27, 2025

Uh oh!

rafvasq commented Mar 27, 2025

Uh oh!

Uh oh!

	return (self.engine_core.is_engine_dead or not self.is_running)
	return self.engine_core.is_engine_dead or not self.is_running

Uh oh!

[V1][Frontend] Improve Shutdown And Logs #14048

[V1][Frontend] Improve Shutdown And Logs #14048

Uh oh!

Conversation

rafvasq commented Feb 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

rafvasq commented Mar 25, 2025

Uh oh!

mergify bot commented Mar 27, 2025

Uh oh!

rafvasq commented Mar 27, 2025

Uh oh!

Uh oh!

rafvasq commented Feb 28, 2025 •

edited by github-actions bot

Loading