[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process #21437

david6666666 · 2025-07-23T07:15:38Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

fix #21432 ，After disagg_example_p2p_nccl_xpyd.sh cleanup, there is a zombie process disagg_proxy_p2p_nccl_xpyd.py.
In order to become an independent daemon process, the disagg_proxy_p2p_nccl_xpyd.py script performs an operation similar to os.setsid() in the code, which creates a new process group. As a result, it no longer belongs to the process group of the original script, so the kill -- -$ command cannot affect it, causing it to remain after the script ends.

Test Plan

bash vllm/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh

Test Result

running:

 ps -ef 
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 01:00 pts/0    00:00:00 /bin/bash
root        156      0  0 01:01 pts/1    00:00:00 bash
root      87417      0  0 06:39 pts/2    00:00:00 bash
root      94300    156  0 07:05 pts/1    00:00:00 bash disagg_example_p2p_nccl_xpyd.sh
root      94515  94300  1 07:05 pts/1    00:00:00 python3 disagg_proxy_p2p_nccl_xpyd.py
root      94516  94300 85 07:05 pts/1    00:00:18 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      94517  94300 89 07:05 pts/1    00:00:19 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      94518  94300 88 07:05 pts/1    00:00:18 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      94519  94300 88 07:05 pts/1    00:00:19 /usr/bin/python /usr/local/bin/vllm serve /workspace/mo
root      95654  94518  1 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95655  94518 99 07:05 pts/1    00:00:10 /usr/bin/python -c from multiprocessing.spawn import sp
root      95755  94517  2 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95758  94517 99 07:05 pts/1    00:00:05 /usr/bin/python -c from multiprocessing.spawn import sp
root      95759  94519  3 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95760  94519 99 07:05 pts/1    00:00:05 /usr/bin/python -c from multiprocessing.spawn import sp
root      95763  94516  3 07:05 pts/1    00:00:00 /usr/bin/python -c from multiprocessing.resource_tracke
root      95764  94516 99 07:05 pts/1    00:00:05 /usr/bin/python -c from multiprocessing.spawn import sp
root      95959  94300  0 07:05 pts/1    00:00:00 sleep 1
root      95960  87417  0 07:05 pts/2    00:00:00 ps -ef

result:

Warning: P2P NCCL disaggregated prefill XpYd support for vLLM v1 is experimental and subject to change.

Architecture Configuration:
  Model: /workspace/models/Qwen3-30B-A3B-FP8
  Prefill GPUs: 0, Ports: 20003
  Decode GPUs: 1,2,3, Ports: 20005,20007,20009
  Proxy Port: 30001
  Timeout: 1200s

Found 8 GPUs.
Checking if pandas is installed...
pandas is installed.
Checking if datasets is installed...
datasets is installed.
Checking if vllm is installed...
vllm is installed.
Checking if quart is installed...
quart is installed.
Launching disaggregated serving components...
Please check the log files for detailed output:
  - prefill*.log: Prefill server logs
  - decode*.log: Decode server logs
  - proxy.log: Proxy server log

Starting proxy server on port 30001...

Starting 1 prefill server(s)...
  Prefill server 1: GPU 0, Port 20003, KV Port 21001

Starting 3 decode server(s)...
  Decode server 1: GPU 1, Port 20005, KV Port 22001
  Decode server 2: GPU 2, Port 20007, KV Port 22002
  Decode server 3: GPU 3, Port 20009, KV Port 22003

Waiting for all servers to start...
Waiting for server on port 20003...
 * Serving Quart app 'disagg_proxy_p2p_nccl_xpyd'
 * Debug mode: False
 * Please use an ASGI server (e.g. Hypercorn) directly in production
 * Running on http://0.0.0.0:10001 (CTRL + C to quit)
[2025-07-23 07:05:12 +0000] [94515] [INFO] Running on http://0.0.0.0:10001 (CTRL + C to quit)
Add [HTTP:10.90.67.84:20007, ZMQ:10.90.67.84:22002]
Add [HTTP:10.90.67.84:20009, ZMQ:10.90.67.84:22003]
Add [HTTP:10.90.67.84:20005, ZMQ:10.90.67.84:22001]
Add [HTTP:10.90.67.84:20003, ZMQ:10.90.67.84:21001]
Server on port 20003 is ready.
Waiting for server on port 20005...
Server on port 20005 is ready.
Waiting for server on port 20007...
Server on port 20007 is ready.
Waiting for server on port 20009...
Server on port 20009 is ready.

All servers are up. Starting benchmark...
handle_request count: 0, [HTTP:10.90.67.84:20003, ZMQ:10.90.67.84:21001]  [HTTP:10.90.67.84:20009, ZMQ:10.90.67.84:22003]
[2025-07-23 07:06:22 +0000] [94515] [INFO] 127.0.0.1:37104 POST /v1/completions 1.1 200 - 4435184
{"id":"cmpl-___prefill_addr_10.90.67.84:21001___decode_addr_10.90.67.84:22003_c033cd2e519e4f0481650baa4f487532","object":"text_completion","created":1753254380,"model":"/workspace/models/Qwen3-30B-A3B-FP8","choices":[{"index":0,"text":" Also, what is the difference between the special and general theories of relativity? Additionally, can you explain the concept of time dilation and how it relates to the theory of relativity? Finally, how does the theory of relativity affect our understanding of","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":61,"completion_tokens":50,"prompt_tokens_details":null},"kv_transfer_params":null}Benchmarking done. Cleaning up...
Stopping everything…
Terminated

after cleanup:

ps -ef 
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 01:00 pts/0    00:00:00 /bin/bash
root        156      0  0 01:01 pts/1    00:00:00 bash
root      87417      0  0 06:39 pts/2    00:00:00 bash
root      97373  87417  0 07:06 pts/2    00:00:00 ps -ef

(Optional) Documentation Update

Signed-off-by: David Chen <[email protected]>

gemini-code-assist

Code Review

This pull request correctly identifies and fixes an issue with a zombie process by explicitly terminating it during cleanup. My feedback suggests a minor but important improvement: using a more graceful termination signal (SIGTERM instead of SIGKILL) to ensure the proxy server can clean up its resources properly, which enhances the overall robustness of the example script.

gemini-code-assist · 2025-07-23T07:16:46Z

examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh

@@ -93,6 +93,7 @@ ensure_python_library_installed() {
 cleanup() {
    echo "Stopping everything…"
    trap - INT TERM        # prevent re-entrancy
+    pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"


Using SIGKILL (-9) should be a last resort, as it terminates the process immediately without allowing it to perform any cleanup operations. The disagg_proxy_p2p_nccl_xpyd.py script runs a Quart server, which can handle a graceful shutdown on SIGTERM (the default signal for pkill) to properly close network sockets and other resources.

Forcibly killing the server with SIGKILL might leave resources in an inconsistent state, potentially causing issues like "Address already in use" errors if you run the example again quickly.

I recommend removing the -9 flag to allow the process to shut down gracefully.

Suggested change

pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"

pkill -f "disagg_proxy_p2p_nccl_xpyd.py"

github-actions · 2025-07-23T07:47:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

david6666666 · 2025-07-23T10:33:40Z

@DarkLight1337 @KuntaiDu please review,thanks

KuntaiDu

LGTM!

KuntaiDu · 2025-07-23T16:27:02Z

examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh

@@ -93,6 +93,7 @@ ensure_python_library_installed() {
 cleanup() {
    echo "Stopping everything…"
    trap - INT TERM        # prevent re-entrancy
+    pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"


…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: 董巍 <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: avigny <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: shuw <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: x22x22 <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Paul Pak <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

fix p2p_nccl_xpyd example zombie process

0cdf5d0

Signed-off-by: David Chen <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Jul 23, 2025

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

DarkLight1337 requested a review from robertgshaw2-redhat July 23, 2025 10:38

KuntaiDu approved these changes Jul 23, 2025

View reviewed changes

KuntaiDu enabled auto-merge (squash) July 23, 2025 16:27

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025

vllm-bot merged commit f8c15c4 into vllm-project:main Jul 24, 2025
54 of 55 checks passed

DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 24, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

1575894

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: 董巍 <[email protected]>

DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 28, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

ffd2fe2

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: 董巍 <[email protected]>

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

7a08a73

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: avigny <[email protected]>

wenscarl pushed a commit to wenscarl/vllm that referenced this pull request Aug 4, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

5cd6eaf

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: shuw <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

07dc5f0

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: x22x22 <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

27a05e7

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

dc7755b

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

b698b3b

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

66cae7e

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Paul Pak <[email protected]>

taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

7a598ff

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 14, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

e076e10

…llm-project#21437) Signed-off-by: David Chen <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

b752538

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (v…

fdb1581

…llm-project#21437) Signed-off-by: David Chen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process #21437

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process #21437

Uh oh!

david6666666 commented Jul 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 23, 2025

Uh oh!

KuntaiDu Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

david6666666 commented Jul 23, 2025

Uh oh!

KuntaiDu left a comment

Uh oh!

KuntaiDu Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

	pkill -9 -f "disagg_proxy_p2p_nccl_xpyd.py"
	pkill -f "disagg_proxy_p2p_nccl_xpyd.py"

Uh oh!

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process #21437

[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process #21437

Uh oh!

Conversation

david6666666 commented Jul 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

david6666666 commented Jul 23, 2025

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

david6666666 commented Jul 23, 2025 •

edited by github-actions bot

Loading