-
Notifications
You must be signed in to change notification settings - Fork 730
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the Bug
Image built based on the latest main branch we can see continuous "failed to join writer task: I/O error: Broken pipe" immediately post each "Health check successful for generate". Could you please check if it is relevant to commit "fix: Fix warning messages with healthcheck (#4793)" ? Or other possible cause?
I recall no such issue before 12/8.
2025-12-09T14:36:33.805975Z INFO dynamo_runtime::health_check: Spawned health check task for endpoint: generate
2025-12-09T14:36:33.805985Z INFO dynamo_runtime::health_check: Health check task started for: generate
2025-12-09T14:36:33.805992Z INFO dynamo_runtime::component::endpoint: Registering endpoint with request plane server endpoint=generate transport="nats"
2025-12-09T14:36:33.806021Z INFO dynamo_runtime::pipeline::network::ingress::nats_server: NatsMultiplexedServer::register_endpoint called endpoint_name=generate namespace=vllm-agg component=backend instance_id=7042392880458507523
2025-12-09T14:36:33.806033Z INFO dynamo_runtime::pipeline::network::ingress::nats_server: Successfully retrieved service group
2025-12-09T14:36:33.806048Z INFO dynamo_runtime::pipeline::network::ingress::nats_server: Registering NATS endpoint endpoint_name=generate endpoint_with_id=generate-61bb9b0338782503 namespace=vllm-agg component=backend instance_id=7042392880458507523
2025-12-09T14:36:33.806053Z INFO dynamo_runtime::pipeline::network::ingress::nats_server: Starting NATS push endpoint listener (blocking) endpoint_name=generate endpoint_with_id=generate-61bb9b0338782503
2025-12-09T14:36:43.807605Z INFO dynamo_runtime::health_check: Canary timer expired for generate, sending health check
2025-12-09T14:36:43.808823Z ERROR dynamo_runtime::health_check: Health check request failed for generate: instance_id=7042392880458507523 not found for endpoint vllm-agg/backend/generate
2025-12-09T14:36:46.430145Z ERROR http-request: tower_http::trace::on_failure: response failed classification=Status code: 503 Service Unavailable latency=0 ms method=GET uri=/live version=HTTP/1.1
2025-12-09T14:36:51.431332Z ERROR http-request: tower_http::trace::on_failure: response failed classification=Status code: 503 Service Unavailable latency=0 ms method=GET uri=/health version=HTTP/1.1
2025-12-09T14:36:51.693577Z WARN logger._print_warning_once: cudagraph dispatching keys are not initialized. No cudagraph will be used.
2025-12-09T14:36:53.809261Z INFO dynamo_runtime::health_check: Canary timer expired for generate, sending health check
2025-12-09T14:36:53.867328Z INFO dynamo_runtime::health_check: Health check successful for generate
2025-12-09T14:36:53.867586Z ERROR dynamo_runtime::pipeline::network::tcp::client: failed to join writer task: I/O error: Broken pipe (os error 32)
Caused by:
Broken pipe (os error 32)
2025-12-09T14:36:53.868410Z ERROR handle_payload: dynamo_runtime::pipeline::network::ingress::push_handler: Failed to publish response for stream 1b124693-2650-450f-9436-2bdb340d5646
2025-12-09T14:37:03.811901Z INFO dynamo_runtime::health_check: Canary timer expired for generate, sending health check
2025-12-09T14:37:03.859990Z INFO dynamo_runtime::health_check: Health check successful for generate
2025-12-09T14:37:03.860235Z ERROR dynamo_runtime::pipeline::network::tcp::client: failed to join writer task: I/O error: Broken pipe (os error 32)
Caused by:
Broken pipe (os error 32)
Steps to Reproduce
- Build image based on main and XPU device
- Issue occurs post print "Starting NATS push endpoint listener (blocking) endpoint_name=generate endpoint_with_id=generate-61bb9b0338782503" on decoder worker
Expected Behavior
Decoder worker can work normally
Actual Behavior
Decoder worker keeps continuous restarting
dynamo-system vllm-agg-vllmdecodeworker-7784b5dcbb-lpfl2 1/1 Running 7 (4m21s ago) 18m
Environment
main branch + intel GPU
Additional Context
No response
Screenshots
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working