Skip to content

Conversation

david6666666
Copy link
Contributor

@david6666666 david6666666 commented Jul 23, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

fix #21432 ,After disagg_example_p2p_nccl_xpyd.sh cleanup, there is a zombie process disagg_proxy_p2p_nccl_xpyd.py.
In order to become an independent daemon process, the disagg_proxy_p2p_nccl_xpyd.py script performs an operation similar to os.setsid() in the code, which creates a new process group. As a result, it no longer belongs to the process group of the original script, so the kill -- -$ command cannot affect it, causing it to remain after the script ends.

Test Plan

bash vllm/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh

Test Result

running:

 ps -ef 
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 01:00 pts/0    00:00:00 /bin/bash
root        156      0  0 01:01 pts/1    00:00:00 bash
root      87417      0  0 06:39 pts/2    00:00:00 bash
root      94300    156  0 07:05 pts/1    00:00:00 bash disagg_example_p2p_nccl_xpyd.sh
root      94515  94300  1 07:05 pts/1    00:00:00 python3 disagg_proxy_p2p_nccl_xpyd.py
root      94516  94300 85 07:05 pts/1    00:00:18 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      94517  94300 89 07:05 pts/1    00:00:19 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      94518  94300 88 07:05 pts/1    00:00:18 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      94519  94300 88 07:05 pts/1    00:00:19 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      95654  94518  1 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95655  94518 99 07:05 pts/1    00:00:10 /usr/bin/python -c from multiprocessing.spawn import sp
root      95755  94517  2 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95758  94517 99 07:05 pts/1    00:00:05 /usr/bin/python -c from multiprocessing.spawn import sp
root      95759  94519  3 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95760  94519 99 07:05 pts/1    00:00:05 /usr/bin/python -c from multiprocessing.spawn import sp
root      95763  94516  3 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95764  94516 99 07:05 pts/1    00:00:05 /usr/bin/python -c from multiprocessing.spawn import sp
root      95959  94300  0 07:05 pts/1    00:00:00 sleep 1
root      95960  87417  0 07:05 pts/2    00:00:00 ps -ef

result:

Warning: P2P NCCL disaggregated prefill XpYd support for vLLM v1 is experimental and subject to change.

Architecture Configuration:
  Model: /workspace/models/Qwen3-30B-A3B-FP8
  Prefill GPUs: 0, Ports: 20003
  Decode GPUs: 1,2,3, Ports: 20005,20007,20009
  Proxy Port: 30001
  Timeout: 1200s

Found 8 GPUs.
Checking if pandas is installed...
pandas is installed.
Checking if datasets is installed...
datasets is installed.
Checking if vllm is installed...
vllm is installed.
Checking if quart is installed...
quart is installed.
Launching disaggregated serving components...
Please check the log files for detailed output:
  - prefill*.log: Prefill server logs
  - decode*.log: Decode server logs
  - proxy.log: Proxy server log

Starting proxy server on port 30001...

Starting 1 prefill server(s)...
  Prefill server 1: GPU 0, Port 20003, KV Port 21001

Starting 3 decode server(s)...
  Decode server 1: GPU 1, Port 20005, KV Port 22001
  Decode server 2: GPU 2, Port 20007, KV Port 22002
  Decode server 3: GPU 3, Port 20009, KV Port 22003

Waiting for all servers to start...
Waiting for server on port 20003...
 * Serving Quart app 'disagg_proxy_p2p_nccl_xpyd'
 * Debug mode: False
 * Please use an ASGI server (e.g. Hypercorn) directly in production
 * Running on http://0.0.0.0:10001 (CTRL + C to quit)
[2025-07-23 07:05:12 +0000] [94515] [INFO] Running on http://0.0.0.0:10001 (CTRL + C to quit)
Add [HTTP:10.90.67.84:20007, ZMQ:10.90.67.84:22002]
Add [HTTP:10.90.67.84:20009, ZMQ:10.90.67.84:22003]
Add [HTTP:10.90.67.84:20005, ZMQ:10.90.67.84:22001]
Add [HTTP:10.90.67.84:20003, ZMQ:10.90.67.84:21001]
Server on port 20003 is ready.
Waiting for server on port 20005...
Server on port 20005 is ready.
Waiting for server on port 20007...
Server on port 20007 is ready.
Waiting for server on port 20009...
Server on port 20009 is ready.

All servers are up. Starting benchmark...
handle_request count: 0, [HTTP:10.90.67.84:20003, ZMQ:10.90.67.84:21001]  [HTTP:10.90.67.84:20009, ZMQ:10.90.67.84:22003]
[2025-07-23 07:06:22 +0000] [94515] [INFO] 127.0.0.1:37104 POST /v1/completions 1.1 200 - 4435184
{"id":"cmpl-___prefill_addr_10.90.67.84:21001___decode_addr_10.90.67.84:22003_c033cd2e519e4f0481650baa4f487532","object":"text_completion","created":1753254380,"model":"/workspace/models/Qwen3-30B-A3B-FP8","choices":[{"index":0,"text":" Also, what is the difference between the special and general theories of relativity? Additionally, can you explain the concept of time dilation and how it relates to the theory of relativity? Finally, how does the theory of relativity affect our understanding of","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":61,"completion_tokens":50,"prompt_tokens_details":null},"kv_transfer_params":null}Benchmarking done. Cleaning up...
Stopping everything…
Terminated

after cleanup:

ps -ef 
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 01:00 pts/0    00:00:00 /bin/bash
root        156      0  0 01:01 pts/1    00:00:00 bash
root      87417      0  0 06:39 pts/2    00:00:00 bash
root      97373  87417  0 07:06 pts/2    00:00:00 ps -ef

(Optional) Documentation Update

@mergify mergify bot added the documentation Improvements or additions to documentation label Jul 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and fixes an issue with a zombie process by explicitly terminating it during cleanup. My feedback suggests a minor but important improvement: using a more graceful termination signal (SIGTERM instead of SIGKILL) to ensure the proxy server can clean up its resources properly, which enhances the overall robustness of the example script.

@@ -93,6 +93,7 @@ ensure_python_library_installed() {
cleanup() {
echo "Stopping everything…"
trap - INT TERM # prevent re-entrancy
pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using SIGKILL (-9) should be a last resort, as it terminates the process immediately without allowing it to perform any cleanup operations. The disagg_proxy_p2p_nccl_xpyd.py script runs a Quart server, which can handle a graceful shutdown on SIGTERM (the default signal for pkill) to properly close network sockets and other resources.

Forcibly killing the server with SIGKILL might leave resources in an inconsistent state, potentially causing issues like "Address already in use" errors if you run the example again quickly.

I recommend removing the -9 flag to allow the process to shut down gracefully.

Suggested change
pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"
pkill -f "disagg_proxy_p2p_nccl_xpyd.py"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@david6666666
Copy link
Contributor Author

@DarkLight1337 @KuntaiDu please review,thanks

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -93,6 +93,7 @@ ensure_python_library_installed() {
cleanup() {
echo "Stopping everything…"
trap - INT TERM # prevent re-entrancy
pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KuntaiDu KuntaiDu enabled auto-merge (squash) July 23, 2025 16:27
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025
@vllm-bot vllm-bot merged commit f8c15c4 into vllm-project:main Jul 24, 2025
54 of 55 checks passed
DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 24, 2025
DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 28, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
wenscarl pushed a commit to wenscarl/vllm that referenced this pull request Aug 4, 2025
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 14, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: After online_serving disagg_example_p2p_nccl_xpyd.sh cleanup, there is a zombie process
3 participants