Skip to content

Conversation

rafvasq
Copy link
Contributor

@rafvasq rafvasq commented Feb 28, 2025

In place of #11737

authored by: @robertgshaw2-redhat

SUMMARY:

  • Prior to this PR, if we encountered an error in a background process, we kill the process tree immediately, which means that we cannot cleanup resources and cannot return good status codes to clients. This PR overhauls the Error handling to instead shut down the background processes and raise Errors that allow us to return proper HTTP status codes to users
  • Prior to this PR, we were not properly shutting down when Errors occured during startup, especially in the TP case
  • Prior to this PR, we used signals to catch errors from background processes. Due to limitations of Python, this prevented us from running outside the main thread. This is a problem for deployments in TritonServer

DESIGN:

  • for errors during startup, we wrap init code with try...catch and push FAILED over the ready PIPE. This works well since the parent processes are waiting for confirmation

  • for errors during runtime, we wrap the busy loops with try..catch and push failure messages over the existing IPC mechanisms.

  • One weakness is that issues with the ipc mechanisms themselves are not handled explicitly. We will explore this more in a follow up

TEST MATRIX:

  • AsyncLLM, TP=1 + TP>1 --- runtime and startup
  • LLM (MP), TP=1, TP>1 --- runtime and startup
  • LLM (no-MP), TP=1, TP>1 --- runtime and startup

Fixes: #12690

@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 1, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to spend more time on this but left comments from a first pass.

await self.abort(request_id)
if self.log_requests:
logger.info("Request %s failed.", request_id)
logger.exception("GOT EXCEPTION:", exc_info=e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line meant to be here? We should probably change the message if so, and there's no need to set exc_info=e since it will use that by default in the except block.

except Exception as e:
await self.abort(request_id)
if self.log_requests:
logger.info("Request %s failed.", request_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the exception message to the log line too? (but still single line)

Comment on lines +436 to +439
if self.output_handler is None:
return True

return not self.output_handler.done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.output_handler is None:
return True
return not self.output_handler.done()
return self.output_handler is None or not self.output_handler.done()

@property
def errored(self) -> bool:
return False
return (self.engine_core.is_engine_dead or not self.is_running)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return (self.engine_core.is_engine_dead or not self.is_running)
return self.engine_core.is_engine_dead or not self.is_running

Comment on lines +275 to +282
except Exception as e:
logger.exception("EngineCore got an Exception during startup:",
exc_info=e)
ready_pipe.send({"status": "FAILED"})
raise e

finally:
ready_pipe.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this should go here. Probably better to catch/handle in the calling method, including handling sending the pipe ready message since it "owns" the pipe. In the failure case, I'm not sure that we need to send failed status back since the process could die for other reasons without us being able to do so, and we need to handle that equivalently from the client side.

Comment on lines +299 to +301
@staticmethod
def wait_for_ready(
unready_proc_handle: UnreadyWorkerProcHandle) -> WorkerProcHandle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a method on UnreadyProcHandle?

@rafvasq
Copy link
Contributor Author

rafvasq commented Mar 25, 2025

@robertgshaw2-redhat, can I help out with a refactor of this one again?

@mergify mergify bot added the tpu Related to Google TPUs label Mar 27, 2025
Copy link

mergify bot commented Mar 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rafvasq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 27, 2025
@rafvasq
Copy link
Contributor Author

rafvasq commented Mar 27, 2025

Back to #11737

@rafvasq rafvasq closed this Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build frontend needs-rebase ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: V1 cannot be run in Triton Inference Server Backend
3 participants